I see that running pandoc gfm converter multiple times over the same input produces new binary PDF every time. Probably the PDF contains the date when the document was generated. However, I'd like to avoid generating a new PDF if the source doesn't change.
The use case is that I maintain my docs in repository, and I don't want to commit PDF just because they were regenerated as part of total rebuild.
This isn't really Pandoc's problem. Pandoc itself doesn't write PDFs at all, it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds. Fixing that would be up to the engines, not Pandoc.
I use Pandoc in workflows for a book publishing company and ran into this as well. I don't care about commits because I'm not keeping generated binary artifacts in source repositories (and you probably shouldn't either, there are other ways to release artifacts), but blindly regenerating them does take lots of CPU time. The solution is to use a build system that keeps track of what source files are used to generate what products and how to update them if something changes. My solution uses GNU Make for this, but there are lots of less esoteric build systems as well that accomplish the same thing.
I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.
If I am using --pdf-engine xelatex and if the engine is already supports reproducible builds, it there a way to pass the required parameters into it?
I am looking at the https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va right now. That SOURCE_DATE_EPOCH may do the trick.
One last tip โ while PDF files are usually not usually binary reproducible, they often _are_ visually deterministic. You can use a PDF diff tool such as diff-pdf to compare whether anything visually changed between PDF rendering passes. Armed with that knowledge you can choose to discard the newer one as unchanged if you like. Not quite as good as knowing when you need to rebuild it in the first place since it wastes the time building it, but still a useful trick.
I don't want to introduce the stateful build system, because I want to run CI pipeline that would check that all version controlled PDF are up to date. The CI is doing all things from scratch.
You don't need "a stateful build system" for this (at least not in the sense you mean), but you also aren't working totally from scratch either because you have the _last generated_ artifact plus the current sources. Since you mentioned these being part of a VCSโ system you also have the history and can determine whether any of the sources have been updated more recently than the PDF build
The https://diffoscope.org/ shows that only the date is different. Going to try that.
$ podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro \
registry.salsa.debian.org/reproducible-builds/diffoscope 001cv.pdf anatoli.cv.pdf --text-color=always
--- 001cv.pdf
+++ anatoli.cv.pdf
โ --- 001cv.pdf
โโโ +++ anatoli.cv.pdf
โโ Document info
โ @@ -1,3 +1,3 @@
โ -CreationDate: "D:20200716112355+03'00'"
โ +CreationDate: "D:20200716120342+03'00'"
โ Creator: 'LaTeX via pandoc'
โ Producer: 'xdvipdfmx (20190225)'
podman run --rm -t -w $(pwd) -v $(pwd):$(pwd):Z,ro 001cv.pdf anatoli.cv.pdf
@alerque repository timestamps doesn't prove that PDF was generated from the sources that were committed. Only build state can determine that, such as file hashes in SCons etc.
it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds
FYI, I know that ReportLab, for one, provides this as an option. You can see the option in the code here:
Setting SOURCE_DATE_EPOCH helped for xelatex.
export SOURCE_DATE_EPOCH=2461633620
But it is not enough.
--- 001cv.pdf
+++ anatoli.cv.pdf
โโโ dumppdf -adt {}
โ @@ -357,16 +357,16 @@
โ <value><literal>XRef</literal></value>
โ <key>Root</key>
โ <value><ref id="1" /></value>
โ <key>Info</key>
โ <value><ref id="2" /></value>
โ <key>ID</key>
โ <value><list size="2">
โ -<string size="16">ž$ð}‡FïsV]	Ñ2·©</string>
โ -<string size="16">ž$ð}‡FïsV]	Ñ2·©</string>
โ +<string size="16">'M%óôÚçÃ1¡Ôõ¦</string>
โ +<string size="16">'M%óôÚçÃ1¡Ôõ¦</string>
โ </list></value>
...
it there a way to pass the required parameters into it?
yes, see https://pandoc.org/MANUAL.html#option--pdf-engine-opt
Hey @mb21 if I am reading that last diff posted by @abitrolly correctly then my initial comment was wrong and this is at least partially Pandoc's problem. That looks like cross reference IDs are changing between successive runs. If that's actually the case (and the testing isn't flawed) that's something that should be fixed. XRef content should be deterministic. I'd keep this open at least until it's determined whether Pandoc's output is deterministic. How to make PDF engines follow suit is another story of course.
I managed to do this. Not very user friendly, because it relies on shell, needs external file, and makes the file specific to the used engine. There is no command line option to easily wrap these things.
The working recipe for the xelatex engine is the following.
SOURCE_DATE_EPOCH to some fixed value in build scripts, like SOURCE_DATE_EPOCH=2461633620xelatex option for reproducible builds (I named mine anatoli.head.tex)\special {pdf:trailerid [
<00112233445566778899aabbccddeeff>
<00112233445566778899aabbccddeeff>
]}
pandoc command line. Mine bash specific command.SOURCE_DATE_EPOCH=2461633620 pandoc \
--from markdown_github+yaml_metadata_block \
--pdf-engine xelatex \
--include-in-header anatoli.head.tex \
anatoli.cover.md -o anatoli.cover.pdf
So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?
I suppose Pandoc could be extended with a flag that then translates all the things that need to happen and passes various engines their respective arguments.
makes the file specific to the used engine
Of course. The internals of the PDF file format are such that you're never going to get different PDF engines to match how they actually put the file together. Any reproducibility will only be reproducible on the same engine (and likely with a lot of other factors being involved as well, such as same-versions of system libraries such as the text shaper, the same versions of font files, the same versions of templates or classes, and so on).
So the \special{pdf:trailerid [ ... ]} thing solved the XRef related string changing?
Only for xelatex as described in https://tex.stackexchange.com/questions/229605/reproducible-latex-builds-compile-to-a-file-which-always-hashes-to-the-same-va/313605#313605
It's great to have these instructions and perhaps we should include them in the FAQ on the website. I'm not sure how pandoc could be modified to make this easier, though.
I suppose we could add a variable pdf_trailer_id which, if set, adds the relevant bits in the default latex template. Maybe also pdf_creation_date and pdf_modification_date? But this only works for xelatex?
I tested only with xelatex. The StackOverflow answer lists other engines as well. I would expect a flag like --reproducible, which could ideally come with some ways to get debug output describing what pandoc does.
Most helpful comment
This isn't really Pandoc's problem. Pandoc itself doesn't write PDFs at all, it outsources that to other engines and virtually no PDF engines produce reproducible PDF builds. Fixing that would be up to the engines, not Pandoc.
I use Pandoc in workflows for a book publishing company and ran into this as well. I don't care about commits because I'm not keeping generated binary artifacts in source repositories (and you probably shouldn't either, there are other ways to release artifacts), but blindly regenerating them does take lots of CPU time. The solution is to use a build system that keeps track of what source files are used to generate what products and how to update them if something changes. My solution uses GNU Make for this, but there are lots of less esoteric build systems as well that accomplish the same thing.