Pandoc: GFM to HTML creates incompatible anchors vs. Github/Gitlab anchor logic

Created on 11 Nov 2018  路  10Comments  路  Source: jgm/pandoc

Pandoc: 2.4-1 / Debian 9 (github DEB download)

Use: pandoc -s -f gfm+backtick_code_blocks -t html -o file.html file.md

The logic Pandoc is using to generate anchors doesn't match the same logic as used by Github/Gitlab rendering. After searching around, I _think_ this is the routine they use, the list of STRIPPED chars seems to match what I am seeing:

In general, Pandoc's generated format allows more markup (slashes, parens, periods, etc.) in the generated anchor than Github/Gitlab does. Source example of various headings from my Markdown rendered in Pandoc using the markup not being generated correctly:

## Contents

  - [net.ifnames Naming](#netifnames-naming)
  - [/etc/hostname](#etchostname)
  - [Instanced Units (.service, .socket, etc...)](#instanced-units-service-socket-etc)
  - [Mount Units (.mount)](#mount-units-mount)
  - [Example Bind Mount - /var/tmp to /tmp](#example-bind-mount-vartmp-to-tmp)

## net.ifnames Naming

 - pandoc: `net.ifnames-naming`
 - gitlab / github: `netifnames-naming`

## /etc/hostname

 - pandoc: `/etc/hostname`
 - gitlab / github: `etchostname`

## Instanced Units (.service, .socket, etc...)

 - pandoc: `instanced-units-(.service,-.socket,-etc...)`
 - gitlab / github: `instanced-units-service-socket-etc`

## Mount Units (.mount)

 - pandoc: `mount-units-(.mount)`
 - gitlab / github: `mount-units-mount`

## Example Bind Mount - /var/tmp to /tmp

 - pandoc: `example-bind-mount---/var/tmp-to-/tmp`
 - gitlab / github: `example-bind-mount---vartmp-to-tmp`

Gists of each platform rendering showing their anchor generation, it matches what you see rendered when the MD file is saved into the repository view (same rendering engine):

The functional problem is that manually maintained TOC lists which work correctly when using the Markdown files linked directly from Github/Gitlab do not work when Pandoc processes them to create HTML Pages out of the content. With this kind of technical writing it's hard to not use these kinds of markup in Heading elements here and there, especially when referring to filenames or keywords which can't be reworded. Thanks!

Related issues I found: #2821 #3388

All 10 comments

GitHub doesn't use redcarpet any more for rendering. They use a variant of cmark.
It may be that they still use this list of characters, however.

Yeah I wasn't 100% sure, chasing this down ended in several dead ends, I'm not exactly sure what code is where to put my finger on the exact routine. The only reason it "felt right" was the list of stripped characters matched what I was seeing... (not having luck with Google finding the right source code)

I notice that the gfm_auto_identifiersextentsion is not documented in MANUAL.txt. It should be (including the algorithm).

Here's the relevant function (toIdent from Text.Pandoc.Readers.CommonMark):

toIdent :: ReaderOptions -> [Inline] -> String
toIdent opts = map (\c -> if isSpace c then '-' else c)
               . filterer
               . map toLower . stringify
  where filterer = if isEnabled Ext_ascii_identifiers opts
                   then mapMaybe toAsciiChar
                   else filter (\c -> isLetter c || isAlphaNum c || isSpace c ||
                                      c == '_' || c == '-')

@kivikakk might be able to help us locate the exact algorithm GitHub uses to create the automatic header identifiers, so we can match it better.

Ah, it looks like the ascii_identifiers extension (which is enabled by default for gfm) is interfering:

% pandoc -f gfm+gfm_auto_identifiers
## Mount Units (.mount)
<h2 id="mount-units-(.mount)">Mount Units (.mount)</h2>

% pandoc -f gfm+gfm_auto_identifiers-ascii_identifiers
## Mount Units (.mount)
<h2 id="mount-units-mount">Mount Units (.mount)</h2>

This should be trivial to fix.

Roger that! I manage this via CI/CD, testing a pipeline build now to verify.... _insert hold music_

Every example above works great (the actual in-place content as well as a bunch more), 100% fixes everything right up when adding +gfm_auto_identifiers-ascii_identifiers to work around it. Thank you! :)

Great, I'm going to fix the code so that ascii_identifiers works better with gfm_auto_identifiers. (While I'm at it, I'll make gfm_auto_identifiers work also with other formats, and I'll document it.)

@kivikakk might be able to help us locate the exact algorithm GitHub uses to create the automatic header identifiers, so we can match it better.

Happy to help. We:

  • take the textual content of the heading (essentially the innerText of the DOM node)
  • convert it to lowercase in a Unicode-aware manner
  • remove all characters except for hyphen, space, and members of the Unicode general categories Letter, Mark, Number, and Connector_Punctuation
  • convert all spaces to hyphens

That leaves us with the ID. If we've already created a heading with an identical ID, we append -1 to it. If the ID suffixed with -1 has been taken, we try -2, and so on.

@kivikakk thanks, we were close but not exactly there. This helps!

Was this page helpful?
0 / 5 - 0 ratings