Pandoc: GFM to HTML conversion adding extra paragraph markup around sub-list elements

Created on 4 Aug 2020  ·  11Comments  ·  Source: jgm/pandoc

I leverage the latest pandoc (grabbed with curl) in a CI/CD pipeline to process Markdown (GFM) into HTML; as I only edit these files once every few months I don't know _exactly_ when this started happening but my thought is between version 2.9.2.1 and 2.10 based on the last time I edited a MD file and it (re)generated the HTML which was fine. This report is having just edited and run the pipeline today which used 2.10.1 version compiled for Debian downloaded via github Releases.

The Markdown looks like this:

## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)

The processing used in a loop of all files is creating the HTML like so:

  pandoc -s \
    -f gfm+gfm_auto_identifiers-ascii_identifiers \
    -t html \
    --template="./style/pandoc_html.tpl" \
    --include-in-header="./style/header_include.html" \
    --include-before-body="./style/body_before.html" \
    --include-after-body="./style/body_after.html" \
    --metadata pagetitle="$_TITLE" \
    --css="${_CSS}" \
    -o "${_HTML}" "$file"

The resulting HTML has extra embedded <p> elements wrapping the sub-list items, but this is inconsistent; on some markdown pages where only a top list exists (TOC with no leafs) it injects <p> inside the list elements, but in this case it's "mix and match" within the list and sub-list like so:

<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><p><a href="#apache-iptables-ports">Apache iptables Ports</a></p></li>
<li><p><a href="#apache-default-template">Apache Default Template</a></p></li>
<li><p><a href="#apache-80-template">Apache 80 Template</a></p></li>
<li><p><a href="#apache-443-template">Apache 443 Template</a></p></li>
</ul></li>
</ul>

As far as I can recall, the <p> elements never injected in the older version (I would have noticed as now I have huge extra line spacing between elements), now all the TOC generated output is a mashup of "sometimes" causing odd visual formatting. Once CSS is applied, the above ends up looking like this:

Screenshot at 2020-08-04 12-17-54

The result is semi-random (I'm sure there's a pattern hiding in there), as the placement of <p> elements seems to be random depending on the TOC construction (how many elements and sub-list elements). Pandoc definitely did not do this before, it's something new -- the last time I ran my CI/CD it used a pandoc feature gfm+backtick_code_blocks+... which was deprecated in the latest code (my CI/CD failed and I had to go fix the script to remove that), if that helps tell when it was last working correctly - that feature was still possible/accepted.

Thanks!

more-info-needed

Most helpful comment

And, I can reproduce this with the commonmark cli tool from commonmark-hs.
So, yes, this is an issue in commonmark-hs.
I'll open a new issue there.
https://github.com/jgm/commonmark-hs/issues/56

EDIT: you can work around this in your pipeline by stripping excess blank lines before passing to pandoc.

All 11 comments

Pandoc switched to a new library for generatingparsing gfm in 2.10.1.
We should be able to fix this, but for now you could try reverting to an earlier version.

Might be a bug in commonmark-hs... should we migrate it there?

Well this is puzzling, because I can't reproduce it.

 % pandoc -f gfm+gfm_auto_identifiers-ascii_identifiers -t html 
## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)
^D
<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><a href="#apache-iptables-ports">Apache iptables Ports</a></li>
<li><a href="#apache-default-template">Apache Default Template</a></li>
<li><a href="#apache-80-template">Apache 80 Template</a></li>
<li><a href="#apache-443-template">Apache 443 Template</a></li>
</ul></li>
</ul>

While I can't recreate it using the tool above, a little testing tells me that it broke between 2.10 and 2.10.1; I was able to test 2.9.2.1 and 2.10 and receive the expected HTML sans <p> elements:

<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><a href="#apache-iptables-ports">Apache iptables Ports</a></li>
<li><a href="#apache-default-template">Apache Default Template</a></li>
<li><a href="#apache-80-template">Apache 80 Template</a></li>
<li><a href="#apache-443-template">Apache 443 Template</a></li>
</ul></li>
</ul>

pandoc_2 9 2 1

When I bounce the script back up to the latest 2.10.1 we get the injection of the extra paragraphs inside lists. Verbose mode didn't reveal anything interesting, no changes other than swapping out the release version.

CI/CD pipeline (.gitlab-ci.yml)

image: debian:latest

before_script:
  - bash ./bin/debian_pandoc.sh >/dev/null

test:
  stage: test
  script:
  - bash ./bin/generate_html.sh
  only:
  - branches
  - tags

pages:
  stage: deploy
  script:
  - bash ./bin/generate_html.sh
  artifacts:
    paths:
    - public
  only:
  - master

debian_pandoc.sh

export DEBCONF_NOWARNINGS="yes"
echo 'debconf debconf/frontend select Noninteractive' | debconf-set-selections
apt-get -qq update
apt-get -qq -y install curl pandoc

_VER=$(curl -s "https://api.github.com/repos/jgm/pandoc/releases/latest" | grep -Po '"tag_name": "\K.*?(?=")')

curl -sLo "pandoc-${_VER}-1-amd64.deb" "https://github.com/jgm/pandoc/releases/download/${_VER}/pandoc-${_VER}-1-amd64.deb"

apt-get -qq -y install "./pandoc-${_VER}-1-amd64.deb"
apt-get -qq -y autoremove

generate_html.sh

# gitlab-ci looks here
[[ ! -d ./public ]] && mkdir ./public

# copy our CSS - <link rel="stylesheet" href="${_CSS}" />
_CSS="mdhtml.css"
cp "./style/${_CSS}" ./public/

# icon
_FAV="favicon.ico"
cp "./style/${_FAV}" ./public/

# ./src/foo.md -> ./public/foo.html
for file in ./src/*.md; do
  _FILE="${file##*/}"
  _HTML="./public/${_FILE%.*}.html"
  echo "Processing $file to $_HTML"

  # metadata for pandoc
  _TITLE=$(grep -m1 "^# " "$file" | sed -r 's/# //')

  # [foo](foo.md) -> [foo](foo.html)
  #  sed -i -r 's/(\[.*?\])\((.*?)\.md\)/\1(\2.html)/' "$file"
  # sed does not support non-greedy (.*?) like perl, we have to hack it
  sed -i -r \
    -e ':loop' \
    -e 's/(\[.*\])\((.*)\.md\)/\1(\2.html)/g' \
    -e 't loop' $file

  pandoc -s \
    -f gfm+gfm_auto_identifiers-ascii_identifiers \
    -t html \
    --template="./style/pandoc_html.tpl" \
    --include-in-header="./style/header_include.html" \
    --include-before-body="./style/body_before.html" \
    --include-after-body="./style/body_after.html" \
    --metadata pagetitle="$_TITLE" \
    --css="${_CSS}" \
    -o "${_HTML}" "$file"
done

The Markdown is written in pure generic GFM using the "4 spaces indent" style for sub-list items, the idea is that these documents display the same inside Gitlab/Github rendering as they do when generated to HTML and some CSS applied (we fixed a pandoc issue about a year ago related to these TOC entries not matching, I'm the same guy).

The template is almost the same as the default pandoc one, I had to use a unique one to override some of the CSS/HTML embedded in the internal template (it's been so long I forget what, exactly - something in the header is hard coded?) but just in case here's the TPL file referenced above:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="$lang$" xml:lang="$lang$"$if(dir)$ dir="$dir$"$endif$>
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
$for(author-meta)$
  <meta name="author" content="$author-meta$" />
$endfor$
$if(date-meta)$
  <meta name="dcterms.date" content="$date-meta$" />
$endif$
$if(keywords)$
  <meta name="keywords" content="$for(keywords)$$keywords$$sep$, $endfor$" />
$endif$
  <title>$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$</title>
$if(quotes)$
  <style type="text/css">
      q { quotes: "“" "”" "‘" "’"; }
  </style>
$endif$
$if(highlighting-css)$
  <style type="text/css">
$highlighting-css$
  </style>
$endif$
$for(css)$
  <link rel="stylesheet" href="$css$" />
$endfor$
$if(math)$
  $math$
$endif$
$for(header-includes)$
  $header-includes$
$endfor$
</head>
<body>
$for(include-before)$
$include-before$
$endfor$
$if(title)$
<header id="title-block-header">
<h1 class="title">$title$</h1>
$if(subtitle)$
<p class="subtitle">$subtitle$</p>
$endif$
$for(author)$
<p class="author">$author$</p>
$endfor$
$if(date)$
<p class="date">$date$</p>
$endif$
</header>
$endif$
$if(toc)$
<nav id="$idprefix$TOC">
$table-of-contents$
</nav>
$endif$
$body$
$for(include-after)$
$include-after$
$endfor$
</body>
</html>

I create the TOCs manually so they display inside Gitlab/Github, it's not pandoc generating the TOC example just to be clear. A different page which has no sub-lists in the TOC has it's HTML looking like this:

<h2 id="contents">Contents</h2>
<ul>
<li><p><a href="#overview">Overview</a></p></li>
<li><p><a href="#process">Process</a></p></li>
</ul>

Another document has the mix-and-match going on:

<h2 id="contents">Contents</h2>
<ul>
<li><p><a href="#prerequisites">Prerequisites</a></p>
<ul>
<li><a href="#ad-setup-information">AD Setup Information</a></li>
</ul></li>
<li><p><a href="#implementation">Implementation</a></p>
<ul>
<li><a href="#install-rpms">Install RPMs</a></li>
<li><a href="#dns-configuration">DNS Configuration</a></li>
<li><a href="#configure-kerberos">Configure Kerberos</a>
<ul>
<li><a href="#get-a-kerberos-ticket">Get a Kerberos ticket</a></li>
<li><a href="#list-the-ticket-provided">List the ticket provided</a></li>
<li><a href="#destroy-the-ticket">Destroy the ticket</a></li>
</ul></li>
<li><a href="#samba-configuration">Samba Configuration</a>
<ul>
<li><a href="#join-the-domain">Join the domain</a></li>
<li><a href="#configure-winbind-authentication">Configure winbind authentication</a></li>
</ul></li>
<li><a href="#pam-configuration">PAM Configuration</a>
<ul>
<li><a href="#rhel5-and-rhel6">RHE5 and RHEL6</a></li>
<li><a href="#rhel6-only">RHEL6 Only</a></li>
</ul></li>
<li><a href="#parent-home-directory">Parent Home Directory</a></li>
</ul></li>
<li><p><a href="#testing">Testing</a></p></li>
<li><p><a href="#cached-logins">Cached Logins</a></p></li>
<li><p><a href="#user-crontabs">User crontabs</a></p></li>
<li><p><a href="#references">References</a></p></li>
</ul>

...this last one is interesting because sub-sub-lists (??) are missing it - so a sub-list without sub-sub has extra <p> but a sub-list with sub-sub-lists has no <p> added in the list elements. This one deserves another screenshot.

mixed

Something strange is afoot at the Circle-K...

Puzzling indeed, I can reproduce with the pandoc from homebrew:

pandoc 2.10.1
Compiled with pandoc-types 1.21, texmath 0.12.0.2, skylighting 0.8.5
Default user data directory: /Users/maurobieg/.local/share/pandoc or /Users/maurobieg/.pandoc
Copyright (C) 2006-2020 John MacFarlane
Web:  https://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.
~ pandoc -f gfm
## Contents

  - [Server Installation](#server-installation)
  - [Server User Setup](#server-user-setup)
      - [Disable root Login](#disable-root-login)
  - [Server Hardening](#server-hardening)
      - [fail2ban Setup](#fail2ban-setup)
  - [Apache Webserver](#apache-webserver)
      - [Apache iptables Ports](#apache-iptables-ports)
      - [Apache Default Template](#apache-default-template)
      - [Apache 80 Template](#apache-80-template)
      - [Apache 443 Template](#apache-443-template)







<h2 id="contents">Contents</h2>
<ul>
<li><a href="#server-installation">Server Installation</a></li>
<li><a href="#server-user-setup">Server User Setup</a>
<ul>
<li><a href="#disable-root-login">Disable root Login</a></li>
</ul></li>
<li><a href="#server-hardening">Server Hardening</a>
<ul>
<li><a href="#fail2ban-setup">fail2ban Setup</a></li>
</ul></li>
<li><a href="#apache-webserver">Apache Webserver</a>
<ul>
<li><p><a href="#apache-iptables-ports">Apache iptables Ports</a></p></li>
<li><p><a href="#apache-default-template">Apache Default Template</a></p></li>
<li><p><a href="#apache-80-template">Apache 80 Template</a></p></li>
<li><p><a href="#apache-443-template">Apache 443 Template</a></p></li>
</ul></li>
</ul>

I haven't master on this machine though... and are we sure try.pandoc is running 2.10.1 ?

are we sure try.pandoc is running 2.10.1 ?

It puts the version (generated by the library) on the bottom of the page when you convert -- so yes.
Maybe the homebrew pandoc is not right?

Can you put something into your pipeline that runs pandoc --version so we can confirm that the correct version is being run?

And, I can reproduce this with the commonmark cli tool from commonmark-hs.
So, yes, this is an issue in commonmark-hs.
I'll open a new issue there.
https://github.com/jgm/commonmark-hs/issues/56

EDIT: you can work around this in your pipeline by stripping excess blank lines before passing to pandoc.

Excellent, thank you for diagnosing and fixing so quickly, much appreciated.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kevinushey picture kevinushey  ·  79Comments

uvtc picture uvtc  ·  47Comments

jgm picture jgm  ·  51Comments

jgm picture jgm  ·  266Comments

GeraldLoeffler picture GeraldLoeffler  ·  143Comments