Pandoc: docx: corrupted file caused by metadata fields with different case

Created on 31 Mar 2019  Â·  16Comments  Â·  Source: jgm/pandoc

With #5252 (pandoc 2.6 ) it is possible to store custom metadata properties in docx. But in rare cases this can lead to corrupt docx files if two metadata fields just differing in case.

The cause of this problem was found by @lierdakil as part of lierdakil/pandoc-crossref#223. A MWE without involving pandoc-crossref is:

echo -e "---\ntest: 1\nTest: 1\n..." | pandoc -o corrupt.docx
Docx writer

Most helpful comment

I like the idea of having a warning in the docx writer when it ignores these case-duplicates.

All 16 comments

Interesting! There's probably no perfect solution, but @agusmba, @jgm, how does this sound?

  1. the first usage of a custom prop gets kept, regardless of case.

  2. future uses with different case get thrown out, and trigger a warning.

  3. future uses with same case overwrite.

  4. a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

Oh, and question: should this be specific to docx or a general rule? I'd tend to think the latter -- less confusion.

or a general rule

Please don't go redefining YAML semantics just because you feel it's "less confusion". By the end of the road, you'll most likely get _more_ confusion. YAML is case-sensitive. In general, that should be the only thing that matters. At the very least, pandoc's metadata has been case-sensitive since the beginning. Changing that now should at least be accompanied by a major version bump, since it's a huge change in behaviour, as far as I'm concerned.

future uses with same case overwrite

This is kinda irrelevant. YAML parser used in Pandoc will refuse to parse YAML objects with duplicate fields. As it should to be a compliant YAML parser (YAML spec 1.2 says that YAML object MUST have unique keys to be valid)

Edit: nope, apparently I'm wrong here. Pandoc's YAML parser will silently use the latest definition of the field (my previous test had other syntactical errors):

$ echo -e "---\ntest: 1\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 2
---

Lorem

a custom prop with different case of built-in metadata prop get thrown out, and triggers a warning.

Not sure what you mean here. Could you explain in more detail?

As I see it, the issue is that OOXML's custom props use case-insensitive names, while pandoc's metadata is case-sensitive (and it was since the beginning, so changing that now doesn't sound good; maybe in pandoc v3.0 that would be justified). Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

Due to pandoc blindly shoving its metadata into docx it can produce invalid docx. Hence, the logical way to go about fixing this is to fix docx writer so that it never produces invalid docx and shows warnings when it can't translate pandoc's richer metadata model into docx.

That's what I meant to say. I wasn't talking about YAML at all -- I was referring to what the docx writer does with the metadata it's handed, and how we get produce docx custom properties. (That's why I was referring to it as properties, and not fields.)

As far as a general rule goes, I just meant for writers that produce these xml formats with custom metadata properties. But yes, that's a silly way to deal with individual clunky formats.

That's what I meant to say.

Okay, cool.

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

or writers that produce these xml formats with custom metadata properties

About that... pptx is obviously the same as docx, that's a given. But I'm not so sure about ODT. LibreOffice doesn't mind metadata properties differing in case only (_both_ in docx and odt), but whether that's a quirk of LibreOffice, or something related to the ODF specification, I can't say.

@lierdakil

I was wrong about pandoc's YAML parser btw, it will silently accept duplicate fields and use the last value it's seen, apparently. That'll teach me not to run tests while still half-asleep.

Not really, please note that:

$ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown
---
test: 1
---

which is consistent with the documentation:

A document may contain multiple metadata blocks. The metadata fields will be combined through a left-biased union: if two metadata blocks attempt to set the same field, the value from the first block will be taken.

What wasn't stated apparently was that if a field is defined multiple times within the same yaml block, the last definition is kept (I don't think this is a bug since many libraries act this way, but it might be a good idea to document this also).

That said, I would apply the same rationale to custom properties in docx|odt (ignoring their case):

  • within the same yaml block, the latest definition is kept
  • duplicate definitions in subsequent yaml blocks are ignored

actually since this is a docx|odt quirk, we might go around solving it in a simpler way, that is:

At the writer level, where pandoc has already filtered real duplicates for us, use the first definition of a property and if a second definition with different case is found, ignore it throwing a warning.

пн, 1 апр. 2019 г., 10:14 Agustín Martín Barbero notifications@github.com:

@lierdakil https://github.com/lierdakil

I was wrong about pandoc's YAML parser btw, it will silently accept
duplicate fields and use the last value it's seen, apparently. That'll
teach me not to run tests while still half-asleep.

Not really, please note that:

$ echo -e "---\ntest: 1\n...\n---\ntest: 2\n...\nLorem" | pandoc -s -t markdown

test: 1

which is consistent with the documentation
https://pandoc.org/MANUAL.html#extension-yaml_metadata_block:

A document may contain multiple metadata blocks. The metadata fields will
be combined through a left-biased union: if two metadata blocks attempt to
set the same field, the value from the first block will be taken.

In my mind, this is a completely different case. I.e. here we're talking
about parsing each metadata block individually, then combining the results,
as opposed to parsing a single metadata block. According to YAML spec,
duplicate field names are an error, which I initially assumed pandoc's YAML
parser observes (aeson? can't recall from the top of my head). I guess it
makes sense tat it behaves the way it does considering basically the same
code is used to parse JSON (which allows duplicate object fields). It's not
a huge deal of course, just something to be aware of.

What wasn't stated apparently was that if a field is defined multiple
times within the same yaml block, the last definition is kept (I don't
think this is a bug since many libraries act this way, but it might be a
good idea to document this also).

That said, I would apply the same rationale to custom properties in
docx|odt (ignoring their case):

  • within the same yaml block, the latest definition is kept
  • duplicate definitions in subsequent yaml blocks are ignored

This would work for reading custom properties from those formats (which
isn't implemented yet AFAIK), but not as much for writing. IIRC pandoc uses
a Map to store metadata objects, so any information about original order is
lost by the time AST gets to writer code.

>
>

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/jgm/pandoc/issues/5413#issuecomment-478463018, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AG8EZi8QR58BTQ9nv0QGb0xgS9dGFXYeks5vcbG6gaJpZM4cUkZW
.

пн, 1 апр. 2019 г., 10:20 Agustín Martín Barbero notifications@github.com:

actually since this is a docx|odt quirk, we might go around solving it in
a simpler way, that is:

At the writer level, where pandoc has already filtered real duplicates for
us, use the first definition of a property and if a second definition with
different case is found, ignore it throwing a warning.

Except writer has no information about original definition order, and
changing that would require nontrivial changes to AST. So "first" is
ambiguous here.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/jgm/pandoc/issues/5413#issuecomment-478464624, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AG8EZnp4W-JOUgxAZpyEQKGFC9-EQKO8ks5vcbMqgaJpZM4cUkZW
.

Except writer has no information about original definition order, and
changing that would require nontrivial changes to AST. So "first" is
ambiguous here.

Fair enough, I'd rather have a simple solution with a warning, so if the order is unknown let's not even worry about it (if it is known, like alphabetical, lower case first..., we could include it in the warning). Whenever the docx|odt writer detects a case-duplicate ignore it and warn the user.

Warning example:

Ambiguous definition of property XXXX, docx|odt does not support custom properties with the same name (ignoring case), only one definition will be included in the output.

Actually if we document the behavior in the MANUAL, there'd be no need for the warning (as what happens currently with real duplicates)

there'd be no need for the warning

I would argue having a warning is better than having no warning, regardless of documentation. Explicit over implicit.

I like the idea of having a warning in the docx writer when it ignores these case-duplicates.

Small update. We need to do this for odt too. All the spec has to say is this:

The name shall be unique within the domain of elements.

without defining precisely what "unique" means.

While LibreOffice and OpenOffice seemingly use case-sensitive comparison (or at least they don't complain about "corrupted file" and don't break rendering), Word does not, and hence also complains about odt with properties with names that differ only in case.

Ignoring case-insensitive doublicates and having a warning is IMHO a simple fix and restores the compatibility as before of #5252.

pptx writer will need to be fixed as well probably.

Is this easy to fix?

Was this page helpful?
0 / 5 - 0 ratings