Pandoc: base-64-encoded images are converted wrongly from HTML to Latex

Created on 3 May 2017  路  9Comments  路  Source: jgm/pandoc

Steps to reproduce:

  1. save this page with Firefox: < removed>
  2. call command pandoc -f html -o result.pdf --latex-engine=xelatex "< saved page>.html"
  3. open result.pdf with Adobe Reader DC and scroll down a little bit

Adobe Reader DC reports an error in the document and tell you to report it to the author of the file. This happens only with --latex-engine=xelatex option and

pandoc 1.19.2.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, skylighting 0.1.1.4
LaTeX writer

Most helpful comment

I think I found the problem @mb21 and @wilx. In the page mentioned above there are <img src="data:image/png;base64,[...]> tags used. They are converted to this tex:

\includegraphics{data:image/png;base64,[...]}

Latex compiler complains about image file being missing.

I think this will help:
https://tex.stackexchange.com/questions/208819/embedding-images-in-tex-file-as-base64-strings

Furthermore it looks like there is a problem with .png files as they are reported as unknown file extension.

All 9 comments

please be more specific in your bug reports or ask on pandoc-discuss mailing list if you're unsure... what's the error Adobe Reader reports? which minimal HTML snippet causes it? Probably it's something XeLaTeX needs to fix and nothing pandoc can do anything about..

The mailing list is problematic to me - I don't have a Google account and in many places I'm not permitted to disclose informations to Google or use specific Google services because of privacy standards over here and other reasons. Thus it is hard for me to use it nor really practicable.

I will try to find a minimal example. The Adobe Reader's error is a general error and not of any help. It just says document is broken and I should get in touch with the document's author. I tried to reproduce this issue on Linux setup. I have an older version of Latex there and a different pdf document viewer. No problem was reported. So either a pandoc-to-latex problem, a Adobe Reader bug or a bug in latex engine.

More details will come.

Since it is XeLaTeX what produces the document it is either its problem or Acrobat Reader's problem.

I think I found the problem @mb21 and @wilx. In the page mentioned above there are <img src="data:image/png;base64,[...]> tags used. They are converted to this tex:

\includegraphics{data:image/png;base64,[...]}

Latex compiler complains about image file being missing.

I think this will help:
https://tex.stackexchange.com/questions/208819/embedding-images-in-tex-file-as-base64-strings

Furthermore it looks like there is a problem with .png files as they are reported as unknown file extension.

I think #2289 would help with this.

@jgm should I create a new issue for Furthermore it looks like there is a problem with .png files as they are reported as unknown file extension.? This sounds unrelated to base-64 encoded images problem.

@GiantCrocodile If you can reproduce the problem about .png images using a regular file (rather than a data uri), then yes, feel free to open an issue.
I just tried with a png image and had no problems.

Now that #2289 is fixed, you should be able to use --extract-media=dir when converting from HTML to LaTeX, and you'll get a file in dir with the contents of the data uri, plus a reference to that file in the LaTeX.

Of course, if you don't use --extract-media=dir you'll still have a problem. I think the one remaining improvement we could make would be to have the LaTeX writer suppress images with data URIs and issue a warning, with advice to use --extract-media.

I tried it. When I compile it to .pdf directly I get a working file and if I use --extract-media="<some path"> it does work too. Thanks for fixing this!

Was this page helpful?
0 / 5 - 0 ratings