Pandoc: Reproducible Markdown to PDF conversion

Created on 16 Jul 2020 · 15Comments · Source: jgm/pandoc

I see that running pandoc gfm converter multiple times over the same input produces new binary PDF every time. Probably the PDF contains the date when the document was generated. However, I'd like to avoid generating a new PDF if the source doesn't change.

The use case is that I maintain my docs in repository, and I don't want to commit PDF just because they were regenerated as part of total rebuild.

LaTeX

Source

abitrolly

Most helpful comment

This isn't really Pandoc's problem. Pandoc itself doesn't write PDFs at all, it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds. Fixing that would be up to the engines, not Pandoc.

I use Pandoc in workflows for a book publishing company and ran into this as well. I don't care about commits because I'm not keeping generated binary artifacts in source repositories (and you probably shouldn't either, there are other ways to release artifacts), but blindly regenerating them does take lots of CPU time. The solution is to use a build system that keeps track of what source files are used to generate what products and how to update them if something changes. My solution uses GNU Make for this, but there are lots of less esoteric build systems as well that accomplish the same thing.

alerque on 16 Jul 2020

👍3 ❤1

All 15 comments

alerque on 16 Jul 2020

👍3 ❤1

I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.

If I am using --pdf-engine xelatex and if the engine is already supports reproducible builds, it there a way to pass the required parameters into it?

I am looking at the https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va right now. That SOURCE_DATE_EPOCH may do the trick.

abitrolly on 16 Jul 2020

❤1

One last tip — while PDF files are usually not usually binary reproducible, they often _are_ visually deterministic. You can use a PDF diff tool such as diff-pdf to compare whether anything visually changed between PDF rendering passes. Armed with that knowledge you can choose to discard the newer one as unchanged if you like. Not quite as good as knowing when you need to rebuild it in the first place since it wastes the time building it, but still a useful trick.

alerque on 16 Jul 2020

I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.

You don't need "a stateful build system" for this (at least not in the sense you mean), but you also aren't working totally from scratch either because you have the _last generated_ artifact plus the current sources. Since you mentioned these being part of a VCS‌ system you also have the history and can determine whether any of the sources have been updated more recently than the PDF build

alerque on 16 Jul 2020

The https://diffoscope.org/ shows that only the date is different. Going to try that.

$ podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro \
      registry.salsa.debian.org/reproducible-builds/diffoscope 001cv.pdf anatoli.cv.pdf --text-color=always
--- 001cv.pdf
+++ anatoli.cv.pdf
│   --- 001cv.pdf
├── +++ anatoli.cv.pdf
│┄ Document info
│ @@ -1,3 +1,3 @@
│ -CreationDate: "D:20200716112355+03'00'"
│ +CreationDate: "D:20200716120342+03'00'"
│  Creator: 'LaTeX via pandoc'
│  Producer: 'xdvipdfmx (20190225)'
podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro  001cv.pdf anatoli.cv.pdf

abitrolly on 16 Jul 2020

@alerque repository timestamps doesn't prove that PDF was generated from the sources that were committed. Only build state can determine that, such as file hashes in SCons etc.

abitrolly on 16 Jul 2020

it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds

FYI, I know that ReportLab, for one, provides this as an option. You can see the option in the code here:

https://github.com/MrBitBucket/reportlab-mirror/blob/67281aea11a81a7768c386d353334e328840b129/src/reportlab/rl_settings.py#L83

cjerdonek on 16 Jul 2020

👍2

Setting SOURCE_DATE_EPOCH helped for xelatex.

export SOURCE_DATE_EPOCH=2461633620

But it is not enough.

--- 001cv.pdf
+++ anatoli.cv.pdf
├── dumppdf -adt {}
│ @@ -357,16 +357,16 @@
│  <value><literal>XRef</literal></value>
│  <key>Root</key>
│  <value><ref id="1" /></value>
│  <key>Info</key>
│  <value><ref id="2" /></value>
│  <key>ID</key>
│  <value><list size="2">
│ -<string size="16">&#158;$&#240;}&#135;F&#239;sV]&#9;&#209;2&#19;&#183;&#169;</string>
│ -<string size="16">&#158;$&#240;}&#135;F&#239;sV]&#9;&#209;2&#19;&#183;&#169;</string>
│ +<string size="16">&#39;&#1;M%&#243;&#244;&#218;&#231;&#195;1&#161;&#212;&#245;&#12;&#31;&#166;</string>
│ +<string size="16">&#39;&#1;M%&#243;&#244;&#218;&#231;&#195;1&#161;&#212;&#245;&#12;&#31;&#166;</string>
│  </list></value>
...

abitrolly on 16 Jul 2020

it there a way to pass the required parameters into it?

yes, see https://pandoc.org/MANUAL.html#option--pdf-engine-opt

mb21 on 16 Jul 2020

👍1

Hey @mb21 if I am reading that last diff posted by @abitrolly correctly then my initial comment was wrong and this is at least partially Pandoc's problem. That looks like cross reference IDs are changing between successive runs. If that's actually the case (and the testing isn't flawed) that's something that should be fixed. XRef content should be deterministic. I'd keep this open at least until it's determined whether Pandoc's output is deterministic. How to make PDF engines follow suit is another story of course.

alerque on 16 Jul 2020

❤1

I managed to do this. Not very user friendly, because it relies on shell, needs external file, and makes the file specific to the used engine. There is no command line option to easily wrap these things.

The working recipe for the xelatex engine is the following.

[x] Set environment variable SOURCE_DATE_EPOCH to some fixed value in build scripts, like SOURCE_DATE_EPOCH=2461633620
[x] Create separate tex file with xelatex option for reproducible builds (I named mine anatoli.head.tex)

\special {pdf:trailerid [
    <00112233445566778899aabbccddeeff>
    <00112233445566778899aabbccddeeff>
]}

[x] Reference variable, engine and the file in pandoc command line. Mine bash specific command.

SOURCE_DATE_EPOCH=2461633620 pandoc \
  --from markdown_github+yaml_metadata_block \
  --pdf-engine xelatex \
  --include-in-header anatoli.head.tex \
  anatoli.cover.md -o anatoli.cover.pdf

abitrolly on 16 Jul 2020

👍3

So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?

I suppose Pandoc could be extended with a flag that then translates all the things that need to happen and passes various engines their respective arguments.

makes the file specific to the used engine

Of course. The internals of the PDF file format are such that you're never going to get different PDF engines to match how they actually put the file together. Any reproducibility will only be reproducible on the same engine (and likely with a lot of other factors being involved as well, such as same-versions of system libraries such as the text shaper, the same versions of font files, the same versions of templates or classes, and so on).

alerque on 16 Jul 2020

So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?

Only for xelatex as described in https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va/313605#313605

abitrolly on 16 Jul 2020

It's great to have these instructions and perhaps we should include them in the FAQ on the website. I'm not sure how pandoc could be modified to make this easier, though.
I suppose we could add a variable pdf_trailer_id which, if set, adds the relevant bits in the default latex template. Maybe also pdf_creation_date and pdf_modification_date? But this only works for xelatex?

jgm on 16 Jul 2020

I tested only with xelatex. The StackOverflow answer lists other engines as well. I would expect a flag like --reproducible, which could ideally come with some ways to get debug output describing what pandoc does.

abitrolly on 16 Jul 2020

Was this page helpful?

0 / 5 - 0 ratings