Pandoc: Created ODT documents can contain invalid characters

Created on 2 Dec 2018  路  11Comments  路  Source: jgm/pandoc

This is the code that made pandoc create an invalid ODT document.

MySQL

root@kali:~# hydra -e nsr -l root -P /usr/share/wordlists/metasploit/unix_passwords.txt mysql://192.168.1.10/ -I
Hydra v8.6 (c) 2017 by van Hauser/THC - Please do not use in military or secret service organizations, or for illegal purposes.

Hydra (http://www.thc.org/thc-hydra) starting at 2018-11-21 12:44:18
[INFO] Reduced number of tasks to 4 (mysql does not like many parallel connections)
[DATA] max 4 tasks per 1 server, overall 4 tasks, 1012 login tries (l:1/p:1012), ~253 tries per task
[DATA] attacking mysql://192.168.1.10:3306/
[ERROR] Host '192.168.1.200' is not allowed to connect to this MySQL server

It was converted with

pshem@Computer /path $ pandoc -s -o pandoc_test.odt -f markdown_github -t odt pandoc_test.md

When trying to open it in LibreOffice, one is greeted with an error message saying Read-Error. Format error discovered in the file in sub document content.xml at 34,37. This is the line found at 34,37:

<text:p text:style-name="P8">[ERROR] Host '192.168.1.200' is not allowed to connect to this MySQL server</text:p>

Pandoc version from Ubuntu 16.04:

pandoc -v
pandoc 1.16.0.2
Compiled with texmath 0.8.4.1, highlighting-kate 0.6.1.
Syntax highlighting is supported for the following languages:
    abc, actionscript, ada, agda, apache, asn1, asp, awk, bash, bibtex, boo, c,
    changelog, clojure, cmake, coffee, coldfusion, commonlisp, cpp, cs, css,
    curry, d, diff, djangotemplate, dockerfile, dot, doxygen, doxygenlua, dtd,
    eiffel, email, erlang, fasm, fortran, fsharp, gcc, glsl, gnuassembler, go,
    haskell, haxe, html, idris, ini, isocpp, java, javadoc, javascript, json,
    jsp, julia, kotlin, latex, lex, lilypond, literatecurry, literatehaskell,
    llvm, lua, m4, makefile, mandoc, markdown, mathematica, matlab, maxima,
    mediawiki, metafont, mips, modelines, modula2, modula3, monobasic, nasm,
    noweb, objectivec, objectivecpp, ocaml, octave, opencl, pascal, perl, php,
    pike, postscript, prolog, pure, python, r, relaxng, relaxngcompact, rest,
    rhtml, roff, ruby, rust, scala, scheme, sci, sed, sgml, sql, sqlmysql,
    sqlpostgresql, tcl, tcsh, texinfo, verilog, vhdl, xml, xorg, xslt, xul,
    yacc, yaml, zsh
Default user data directory: /home/pshem/.pandoc
Copyright (C) 2006-2015 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

I'm including the created corrupted file:
pandoc_test.odt.zip

Most helpful comment

U+0004 isn't a legal XML character anyway, so pandoc's generating ill-formed XML. pandoc definitely ought to be able to detect this case, though I'm not sure whether it should be an error, a warning + passthrough because GIGO, or a warning + stripping.

Here's the spec on what characters are allowed: note that even presently-unassigned codepoints are legal, so random new emoji getting added isn't a problem.

Some formats might place additional restrictions on what characters are allowed and those might be more troublesome to check for, especially if they're not written with future versions of Unicode in mind. I think it would make sense to cross that bridge when we come to it.

All 11 comments

  1. Use the newest pandoc version from https://github.com/jgm/pandoc/releases
  2. Could you give us the actual input file? Best as a minimal example.

Sorry, I don't have the time to test if the bug happens in the newest pandoc release right now. Here is the minimal example file I used in the first post( embeding markdown into markdown was... not the best idea)
pandoc_test.md.zip

Your markdown file contains the control character EOT (0x04) right before the word "Host".
I don't know why. This is being passed through to the ODT, which apparently doesn't like it.
(verified with latest version)

Rendering the EOT as &#4; makes no difference, still invalid. This character (and presumably other control characters) should be stripped.

I suppose we could modify escapeStringForXML so that it strips this character (and which others)? It may be, though, that the character is allowed in some XML formats. I'm inclined to do nothing, actually, since there's no clear reason to have an EOT character in a Markdown file; this is garbage in, garbage out, no?

Agreed, if we start stripping unknown Unicode characters, we'll always be one step behind adding the newest emojis...

See related issue #5042

This character probably came to be because the snippet is from an interupted sqlmap run. It would be nice to have a warning about potentially invalid characters though - I think trying to convert to PDF directly errored out in pdflatex and that's why I tried converting to odt in the first place.

Ideally, it would be nice to have a list of characters invalid for each format, but that seems like a lot of effort for little gain. I don't think emoji are likely to break quite as many formats as control characters will. A generic warning for "Contains control characters, might break some formats" shown only when running verbose could work too. It's fair if you want to close this as won't-fix

One thing we could do is to emit warnings when we have non-printable / invisible characters in the input stream. We could even add the option to use a custom whitelist, so people can choose whether to get a warning on e.g. \r or not.

When using --pdf-engine=pdflatex, the whitelist could be restricted to ASCII. And broadened when using XeLaTeX.

U+0004 isn't a legal XML character anyway, so pandoc's generating ill-formed XML. pandoc definitely ought to be able to detect this case, though I'm not sure whether it should be an error, a warning + passthrough because GIGO, or a warning + stripping.

Here's the spec on what characters are allowed: note that even presently-unassigned codepoints are legal, so random new emoji getting added isn't a problem.

Some formats might place additional restrictions on what characters are allowed and those might be more troublesome to check for, especially if they're not written with future versions of Unicode in mind. I think it would make sense to cross that bridge when we come to it.

Copied here for convenience:

Char | ::= | #x9 \| #xA \| #xD \| [#x20-#xD7FF] \| [#xE000-#xFFFD] \| [#x10000-#x10FFFF]

I think the easiest thing to do would be to change escapeStringForXML in Text.Pandoc.XML so that it silently strips out nonconforming characters. If we wanted to add a warning, we could, but this would require completely changing the type of escapeStringForXML and revising a lot of other code.

Was this page helpful?
0 / 5 - 0 ratings