Let's say that you have a Word .docx file and in it you type in a non-Latin character, such as 伪 or 尾, _not_ as an equation, but rather as a normal character. If this .docx is then converted into a LaTeX .tex file, when trying to compile the .tex file it will give an error, because the 伪 is inserted as an 伪 in the .tex file, which is unreadable for the engine, not as an \alpha which would solve this issue. Is this a bug?
Does using --latex-engine=xelatex help?
I am working on a project that has a lot of math in .docx source files and use pandoc to convert it to markdown with LaTeX Math. I wrote a script to automate most of the conversion process to deal with the use of, say, Greek alphabets, in markdown-variants/unicode-to-math.sh at master 路 ickc/markdown-variants.
The script is not perfect, and will probably never be. But it gives a good starting point from the .docx source I need to deal with.
If the only thing you need to deal with is Greek alphabets, you can extract the part in my script that handles the Greek to LaTeX (say, \alpha, \beta) conversion.
Lastly, if you have control over what engine to use, --latex-engine=xelatex would solve the issue. There's a LaTeX package that allows unicode-math (Greek alphabets in math mode) and I think pandoc uses it too. Visually, before and after applied my script are actually quite close using xelatex.
Note that this isn't specific to conversion from .docx: you'd have the same issue converting from any format with a literal 伪 in it.
@ickc That's really neat! Thanks for all the information :)
Should I close this? I'm not too sure whether it's a real issue or just something which purposely not implemented.
Yes, you can go ahead and close it.
It's not a bug. Pandoc produces LaTeX with UTF-8 encoding;
this can include characters like a Greek alpha.
As noted by others, standard pdflatex will typically not
handle these characters; you need to use xelatex, and you
need to make sure the font you're using contains the
necessary glyphs.
Pandoc will convert Word equations to LaTeX equations; I
see that it's a problem if the source document has some math
outside of equations, but I don't see much that pandoc can
do about this. One wouldn't want to assume, for example,
that any alpha was math. (Especially considering that
some people write in Greek, and others write about Greek!)
Thanks a lot! I really appreciate the help and the nice explanation :)
Most helpful comment
I am working on a project that has a lot of math in
.docxsource files and use pandoc to convert it to markdown with LaTeX Math. I wrote a script to automate most of the conversion process to deal with the use of, say, Greek alphabets, in markdown-variants/unicode-to-math.sh at master 路 ickc/markdown-variants.The script is not perfect, and will probably never be. But it gives a good starting point from the
.docxsource I need to deal with.If the only thing you need to deal with is Greek alphabets, you can extract the part in my script that handles the Greek to LaTeX (say,
\alpha, \beta) conversion.Lastly, if you have control over what engine to use,
--latex-engine=xelatexwould solve the issue. There's a LaTeX package that allows unicode-math (Greek alphabets in math mode) and I think pandoc uses it too. Visually, before and after applied my script are actually quite close usingxelatex.