Pandoc: Pandoc (Markdown to HTML) converts some character entities to UTF-8

Created on 3 May 2013  Â·  2Comments  Â·  Source: jgm/pandoc

I have a Markdown document containing the HTML character entity →. When I convert this to HTML using pandoc -o myfile.html myfile.md, the character is converted to a UTF-8 encoded right arrow character, which my browser displays as an ugly jumble →. Other character entities like &, on the other hand, are preserved correctly as inline HTML.

A workaround to this is to include a tag
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
at the beginning of my Markdown document, but that seems a little inelegant as I can't assume that any Markdown converter will produce UTF-8 encoded output. IMHO, pandoc should either consequently preserve HTML character entites, or properly announce UTF-8 encoding in the HTML output.

I'm using pandoc on Windows:

$ pandoc -v
pandoc 1.11.1
Compiled with citeproc-hs 0.3.8, texmath 0.6.1.3, highlighting-kate 0.5.3.8
...

Most helpful comment

Pandoc converts all entities to unicode characters. That is because it needs to handle output formats other than HTML.

If you use the -s flag to create a standalone document, pandoc will apply its default template, which includes the meta tag specifying UTF-8.

Another option is to use the --ascii flag, which will cause &rarr; to be output as &#8594; (the equivalent character).

All 2 comments

Pandoc converts all entities to unicode characters. That is because it needs to handle output formats other than HTML.

If you use the -s flag to create a standalone document, pandoc will apply its default template, which includes the meta tag specifying UTF-8.

Another option is to use the --ascii flag, which will cause &rarr; to be output as &#8594; (the equivalent character).

Thanks a lot!

Was this page helpful?
0 / 5 - 0 ratings