Pandoc: Unicode characters such as α are not properly converted from Word (.docs) to LaTeX (.tex)

Created on 24 Oct 2016 · 8Comments · Source: jgm/pandoc

Let's say that you have a Word .docx file and in it you type in a non-Latin character, such as α or β, _not_ as an equation, but rather as a normal character. If this .docx is then converted into a LaTeX .tex file, when trying to compile the .tex file it will give an error, because the α is inserted as an α in the .tex file, which is unreadable for the engine, not as an \alpha which would solve this issue. Is this a bug?

Source

cleverjackal

Most helpful comment

I am working on a project that has a lot of math in .docx source files and use pandoc to convert it to markdown with LaTeX Math. I wrote a script to automate most of the conversion process to deal with the use of, say, Greek alphabets, in markdown-variants/unicode-to-math.sh at master · ickc/markdown-variants.

The script is not perfect, and will probably never be. But it gives a good starting point from the .docx source I need to deal with.

If the only thing you need to deal with is Greek alphabets, you can extract the part in my script that handles the Greek to LaTeX (say, \alpha, \beta) conversion.

Lastly, if you have control over what engine to use, --latex-engine=xelatex would solve the issue. There's a LaTeX package that allows unicode-math (Greek alphabets in math mode) and I think pandoc uses it too. Visually, before and after applied my script are actually quite close using xelatex.

ickc on 24 Oct 2016

❤1 😄1 👍1

All 8 comments

Does using --latex-engine=xelatex help?

wilx on 24 Oct 2016

The script is not perfect, and will probably never be. But it gives a good starting point from the .docx source I need to deal with.

If the only thing you need to deal with is Greek alphabets, you can extract the part in my script that handles the Greek to LaTeX (say, \alpha, \beta) conversion.

ickc on 24 Oct 2016

❤1 😄1 👍1

Note that this isn't specific to conversion from .docx: you'd have the same issue converting from any format with a literal α in it.

jkr on 24 Oct 2016

@ickc That's really neat! Thanks for all the information :)

cleverjackal on 24 Oct 2016

Should I close this? I'm not too sure whether it's a real issue or just something which purposely not implemented.

cleverjackal on 24 Oct 2016

Yes, you can go ahead and close it.

jgm on 25 Oct 2016

It's not a bug. Pandoc produces LaTeX with UTF-8 encoding;
this can include characters like a Greek alpha.

As noted by others, standard pdflatex will typically not
handle these characters; you need to use xelatex, and you
need to make sure the font you're using contains the
necessary glyphs.

Pandoc will convert Word equations to LaTeX equations; I
see that it's a problem if the source document has some math
outside of equations, but I don't see much that pandoc can
do about this. One wouldn't want to assume, for example,
that any alpha was math. (Especially considering that
some people write in Greek, and others write about Greek!)

jgm on 25 Oct 2016

Thanks a lot! I really appreciate the help and the nice explanation :)

cleverjackal on 25 Oct 2016

Was this page helpful?

0 / 5 - 0 ratings