Pandoc: Regression: Could not parse YAML metadata

Created on 9 Oct 2018  路  6Comments  路  Source: jgm/pandoc

The following problem does not occur in Pandoc 2.2.1 and occurs in all recent versions starting with Pandoc 2.2.2.

Minimal Example

test.yml contains:

---
reason: 'Was geht?

'
---

This is how libyaml embedded by the Ruby programming language outputs strings with trailing newline "\n". The file can be produced with this ruby command:

ruby -r yaml -e 'puts Hash({"reason" => "Was geht?\n"}).to_yaml + "---"'

In my actual setup, I generate meta-data and use this in a document. For the minimal example, I just output the meta-data in JSON AST format.

pandoc test.yml -t json

Expected Output

The output with pandoc 2.2.1 is:

pandoc-2.2.1/bin/pandoc test.yml -t json
{"blocks":[],"pandoc-api-version":[1,17,4,2],"meta":{"reason":{"t":"MetaBlocks","c":[{"t":"Plain","c":[{"t":"Str","c":"Was"},{"t":"Space"},{"t":"Str","c":"geht?"}]}]}}}

Erronous Output

Pandoc 2.2.2 and higher gives a different output:

pandoc-2.2.2/bin/pandoc test.yml -t json                                                                                                      rriemann@mars
[WARNING] Could not parse YAML metadata at line 1 column 1: :2:18: Unexpected '
  '
{"blocks":[],"pandoc-api-version":[1,17,5,1],"meta":{}}

As you can see, the meta data is empty.

The cause is certainly linked to the dependency change to HsYAML from @hvr, that I kindly ask to help determining if the file test.yml is actually supported syntax.

Most helpful comment

I'm pretty confident that

- 'Was geht?

'

or

reason: 'Was geht?

'

are in fact not valid YAML 1.2

If you look at section 7.3.2. Single-Quoted Style, you'll notice that the rules

[123]   nb-ns-single-in-line    ::= ( s-white* ns-single-char )*     
[124]   s-single-next-line(n)   ::= s-flow-folded(n) ( ns-single-char nb-ns-single-in-line   ( s-single-ext-line(n) | s-white* ) )?  
[125]   nb-single-multi-line(n) ::= nb-ns-single-in-line ( s-single-next-line(n) | s-white* )

all have a n parameter which is used to keep track of the relative indentation level to encode the general rule that nodes must be indented one bit more than the block node they're contained in. And in particular, the s-flow-folded(n) production enforces leading indentation before non-space content of amount n.

And as such, if e.g. - (yaml sequence indicator) is at n = 0, then the single-quoted scalar inside that block collection is e.g. at least at level n = 1.


PS: As it turns out, there's a negative test in the YAML testsuite at http://matrix.yaml.io/sheet/invalid.html#QB6E which expects a compliant YAML parser to fail on

---
quoted: "a
b
c"

All 6 comments

HsYAML claims to comply strictly with YAML 1.2, so the
first thing you should do is check whether your sample
conforms to that spec. It's possible that it does
not. If it does, then you should report a bug to HsYAML.
If not, then I don't consider this a bug at all.

Testing directly with HsYAML:

Data.YAML> decodeNode' failsafeSchemaResolver False False (fromStringLazy "foo: 'hi\n'")
Left ":1:8: Unexpected '\n'"
Data.YAML> decodeNode' failsafeSchemaResolver False False (fromStringLazy "foo: 'hi\n '")
Right [Doc (Mapping Nothing (fromList [(Scalar (SUnknown Nothing "foo"),Scalar (SStr "hi "))]))]
Data.YAML> decodeNode' failsafeSchemaResolver False False (fromStringLazy "'hi\n'")
Right [Doc (Scalar (SStr "hi "))]

So the Ruby lib is based on a C lib libyaml that does not support YAML 1.2 yet.
Upstream Bug report: https://github.com/yaml/libyaml/issues/20

I could not find out whether my test file is YAML 1.2 compliant.

I don't think there's much more we can do about this on the pandoc side. If you find there's a bug in HsYAML, you should report there.

I'm pretty confident that

- 'Was geht?

'

or

reason: 'Was geht?

'

are in fact not valid YAML 1.2

If you look at section 7.3.2. Single-Quoted Style, you'll notice that the rules

[123]   nb-ns-single-in-line    ::= ( s-white* ns-single-char )*     
[124]   s-single-next-line(n)   ::= s-flow-folded(n) ( ns-single-char nb-ns-single-in-line   ( s-single-ext-line(n) | s-white* ) )?  
[125]   nb-single-multi-line(n) ::= nb-ns-single-in-line ( s-single-next-line(n) | s-white* )

all have a n parameter which is used to keep track of the relative indentation level to encode the general rule that nodes must be indented one bit more than the block node they're contained in. And in particular, the s-flow-folded(n) production enforces leading indentation before non-space content of amount n.

And as such, if e.g. - (yaml sequence indicator) is at n = 0, then the single-quoted scalar inside that block collection is e.g. at least at level n = 1.


PS: As it turns out, there's a negative test in the YAML testsuite at http://matrix.yaml.io/sheet/invalid.html#QB6E which expects a compliant YAML parser to fail on

---
quoted: "a
b
c"

Thanks for telling us @hvr.

I just report here for those running into similar issues. I used the YAML 1.2 compliant lib ruamel.yaml to find out the YAML 1.2 compliant fix for the example meta data file. One solution (maybe there are others) is:

---
reason: "Was geht?\n"
---

What is different?

  1. use of double quotation marks
  2. use of escape sequence "n" for new line instead of two actual new lines

My solution is to produce my file with Ruby and then fix this one problem manually with regular expressions. Of course, with a different feature set used in the YAML file, other problems may occur that also need manual treatment. So I hope that in the long run, a YAML 1.2 compliant Ruby lib becomes available.

# fix YAML 1.2 compatibility for pandoc > 2.2.1, see https://stackoverflow.com/a/30049447/1407622
sed -r -z -i "s/: '([^']+)\n\n'/: \"\1\\\n\"/g" test.yml
Was this page helpful?
0 / 5 - 0 ratings