Packages: YAML parser compatibility

Created on 18 Aug 2019  ·  11Comments  ·  Source: sublimehq/Packages

_This issue is a result of a discussion at https://github.com/divmain/GitSavvy/pull/1152_

3rd party projects wanting to use sublime-syntax definitions may fail doing so due to parse errors bing thrown by several yaml parsers out in the wild.

While ST's yaml parser and js-yaml load all files fine, ruamel.yaml or powershell-yaml fail with unquoted scalar values like:

- match: =
  scope: test
- match: :=
  scope: test
- match: <<
  scope: test

According to YAML1.2 chapter 7.3.3. Plain Style these kinds of plain values (starting with =, < or : followed by printable character) are valid values as none of them is part of c-indicator.

Is it worth quoting modifying the syntax definitions to help (python-based) yaml parsers to load them properly?

Quotation style of values has changed over time and is even somewhat mixed up. Some values are double quoted some single quoted and others use plain style.

May it be worth thinking about a common quotation style to be used in sublime-syntax files?

Most helpful comment

Ideally, I'd rather not have to modify perfectly valid syntax definitions so that broken third-party YAML processors can handle them. In order to maintain compatibility with processors that only support 1.1, we'd have to make sure that all syntax definitions were polyglot documents with identical interpretations in both versions. This sounds like a lot of hassle, and to ensure compatibility we'd need to run some kind of linter and get it hooked up to the repo CI.

The problem in the linked issue is a bug in ruamel.yaml's non-specific tag resolution; it's using a YAML 1.1 type extension in a YAML 1.2 document. I've submitted an issue on the ruamel.yaml tracker. There is an immediate workaround that will avoid the issue.

The workaround for pyyaml is to use ruamel.yaml instead. pyyaml has very poor support for YAML 1.2 and should not be used to parse sublime-syntax files.

powershell-yaml seems to be a wrapper around YamlDotNet, which only supports YAML 1.1. An issue for YAML 1.2 support has languished for years with no progress. If anyone is interested in 1.2 support for powershell-yaml, they might want to bump that issue.

It's a shame that so few YAML processors support 1.2, especially given that it's more than ten years old.

All 11 comments

Ideally, I'd rather not have to modify perfectly valid syntax definitions so that broken third-party YAML processors can handle them. In order to maintain compatibility with processors that only support 1.1, we'd have to make sure that all syntax definitions were polyglot documents with identical interpretations in both versions. This sounds like a lot of hassle, and to ensure compatibility we'd need to run some kind of linter and get it hooked up to the repo CI.

The problem in the linked issue is a bug in ruamel.yaml's non-specific tag resolution; it's using a YAML 1.1 type extension in a YAML 1.2 document. I've submitted an issue on the ruamel.yaml tracker. There is an immediate workaround that will avoid the issue.

The workaround for pyyaml is to use ruamel.yaml instead. pyyaml has very poor support for YAML 1.2 and should not be used to parse sublime-syntax files.

powershell-yaml seems to be a wrapper around YamlDotNet, which only supports YAML 1.1. An issue for YAML 1.2 support has languished for years with no progress. If anyone is interested in 1.2 support for powershell-yaml, they might want to bump that issue.

It's a shame that so few YAML processors support 1.2, especially given that it's more than ten years old.

Still, our parsers seem to think that ":=" is not a valid scalar. As in the log attached in the related ticket (see result of trace in this comment):

GS [debug] error parsing 'Packages/Go/Go.sublime-syntax' syntax file: while parsing a flow node
expected the node content, but found ':'
  in "<unicode string>", line 288, column 13:
        {match: :=   , scope: keyword.operator.a ... 
                ^ (line: 288)

Seems that the latest version of the spec requires the character following ":" to be a "non-space safe characters", maybe our "=" is not safe? Then again,, how does ST not complain about loading the Go.sublime-syntax file?

Regardless of the specified example above, I'd love for sublime to deal with this for me, so if you (or anyone in the ST team) that handle the ST YAML parser, could provide us with the parser ST uses. Or even better, expose an API to read the parsed sublime-syntax file, that would be optimal.

To be slightly pedantic, even if the characters today happen to parse fine, it will soon come a day when a syntax matcher will need to match e.g "::" or ":#" and quoting will be necessary. Would it not make more sense to have a common guideline for writing a syntax file that suggests to quote the values of e.g a match?

FWIW ruamel.yaml parses the example if you suppress the implicit value and merge tags (for << resp. =) as per the linked GitSavvy issue:

import ruamel.yaml
ir = ruamel.yaml.resolver.implicit_resolvers
for idx in range(len(ir), 0, -1):
    typ = ir[idx-1][1].rsplit(':', 1)[1]
    if typ in ['value', 'merge']:
        del ir[idx-1]

yaml = ruamel.yaml.YAML(typ='safe')  # or leave out for 'rt'
data = yaml.load("""\
- match: =
  scope: test
- match: :=
  scope: test
- match: <<
  scope: test
""")

print(data)

output: [{'match': '=', 'scope': 'test'}, {'match': ':=', 'scope': 'test'}, {'match': '<<', 'scope': 'test'}]

Nothing special needs to be done for the :=, as per https://yaml.org/spec/1.2/spec.html#:%20mapping%20value// :

_Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space. This allows for unquoted URLs and timestamps. It is also a potential source for confusion as “a:1” is a plain scalar and not a key: value pair._

the quoting rules I have been following recently are:
unquoted, unless it needs to be quoted according to the YAML 1.2 spec, then prefer single quotes if the string doesn't contain any, otherwise decide if it is more readable with double quotes. Simple as. So in your example, you would wrap in single quotes when it becomes necessary.

For the record, we use https://github.com/jbeder/yaml-cpp, which isn't going to help any plugins.

I think I got the idea, we have valid YAML, but different parsers choose to implement different default schemas (and sometimes not precisely). Thanks again all for clarifying.

@wbond do you believe exposing a parser method in the python plugin API (that uses yaml-cpp internally) would be a good idea?

In the case of GitSavvy, we'd probably end up stripping these implicit resolvers, or just use yaml.round_trip_load as suggested by @Thom1729 , until a better alternative can be found.

Regarding ruamel, it'd probably be a better idea to disable the implicit resolvers for the merge and value types, which are only in the default? schema for 1.1[1, 2] and not in the recommended schema for 1.2[3].

Also, you should remove the implicit resolvers from your YAML instance, not the global default mapping.

Edit: Actually it looks like modifying the implicit resolvers of a YAML instance isn't quite trivial because the VersionedResolver lazy-loads its implicit resolvers based on the YAML version being parsed using the module-global implicit_resolvers dict and while it's easy to add implicit resolvers, it's not easy to remove them at all. I'll move this to the ruamel.yaml issue tracker.
Edit2: https://bitbucket.org/ruamel/yaml/issues/309/implicit-merge-and-value-types-in-yaml-12

@asfaltboy I very much doubt we would expose yaml-cpp to Python. Partially because we don't have an interest in supporting the full YAML 1.2 spec moving forward, and exposing it via the API would lock us into to keeping yaml-cpp around. Beyond that, it would probably be a bunch of work, when it seems that there are existing YAML libraries for Python. It would probably make more sense for someone to wrap ruamel with appropriate changes.

It looks like there may be a very specific bug in ruamel.yaml's C parser that's affecting the Go syntax. {match: :=} raises an error, but only when the C parser is used and only in a flow context (i.e. match: := works fine). I've opened an issue.

FYI, I wrote a dependency for using ruamel.yaml in Sublime. This is what I use for parsing syntax definitions, e.g. in JS Custom. The C parser is not supported, basically because I didn't want to deal with binaries, but parsing YAML generally isn't performance-critical anyway. I've always the round-trip mode (which makes the JS Custom output much, much easier to debug), so I never noticed the odd resolution of = and << or the {match: :=} bug. Basically, the parts that are original to ruamel.yaml should work fine for sublime-syntax files, but the parts inherited from pyyaml are a bit creaky.

{match: :=}

Just for the record: Don't like this coding style anyway as it is contraproductive in manner of readability due to the extra braces.

Are you sure? Here's the part of st... JSON like looking YAML code.

https://github.com/sublimehq/Packages/blob/d1494f4a5a2bf8c6ae2f3e7ad762a05273df97ef/Go/Go.sublime-syntax#L268-L305

Was this page helpful?
0 / 5 - 0 ratings