Crystal: YAML ParseException parsing strings that look like numbers

Created on 10 Mar 2018 · 15Comments · Source: crystal-lang/crystal

require "yaml"

yaml_data = %(
-   itemID: 30003270
    itemName: 6E-578
)

class NameFile
    YAML.mapping(
        item_id: {
          type:    Int32,
          nilable: false,
          key:     "itemID",
        },
        item_name: {
          type:    String,
          nilable: false,
          key:     "itemName",
        },
    )
end

yaml_test = Array(NameFile).from_yaml(yaml_data)

p yaml_test

Or in simpler terms:

require "yaml"
p String.from_yaml "123"

This is either a bug in the YAML parser or due to invalid YAML

bug topicserialization

Source

Blacksmoke16

Most helpful comment

Here's for C#: https://dotnetfiddle.net/HD2JXM

In that snippet, try changing any value that looks like a string to a number. It works fine.

Then we have Nim that I also mentioned (I read it somewhere, I'm lazy to search it now).

So for now, dynamic languages always look at the shape of the value (they have no choice) while static types respect... types ;-)

I suggest to change it to how it works in other statically typed languages.

asterite on 10 Mar 2018

👍2

All 15 comments

Please include the error message in the bug description:

Expected String, not 6E-578 at line 3, column 15 (YAML::ParseException)

It seems to me that the behaviour is legit. 6E-578 is a valid YAML number literal, therefore it won't parse as string (same for 123, obviously).

The YAML file is also completely valid, the problem is just that the YAML data format does not match the data format you expect in Crystal. In the YAML file, itemName has a value of type float but it is mapped to String in Crystal. I assume there are other entries in the array which have type string, so the matching Crystal type would be String | Float64.

To solve this, you should consider if the YAML data format is actually what you intended, or if you want 6E-578 to be interpreted as a string. If this is the case, you need to tag it as !!str or wrap it in quotes. The plain scalar value means a float.

If the YAML data format is correctly expressing this value as float, you need to think about the Crystal data format. Should it use a float as well? If yes, you need to change the type definition.
If you always want the item_name to be a string, even if the YAML value is a float, you need to use a custom YAML converter.

straight-shoota on 10 Mar 2018

I said it many times already. What you just described is valid for a language like Ruby, where you can't declare the type of a mapping. In Crystal we can, so if I say I want something as a String we can always read it as a string. Nim does exactly that, for example.

Please let's change this behavior. The above thing looks like a floating point number but also looks like an item code. If Crystal is more expressive than Ruby or other dynamic languages there's no need to get down to their level for compatibility. In Ruby you can always read that and then to_s, or maybe use a custom mapper if possible (not sure, but we don't have to care).

asterite on 10 Mar 2018

The above thing looks like a floating point number

That's why per YAML spec it is to be interpreted as such. There is no ambiguity here, it can't be interpreted as a string because it is a float. If the scalar is supposed to be a string, it has just the wrong data format in YAML.

The default behaviour should not just essentially ignore YAML data types and coerce them to whatever type is defined as Crystal mapping. This can lead to unexpected behaviour. If the intended data type does not match, the parser should raise an error.

Of course there can be situations where you need access to the verbatim value in the YAML source to convert it to a different data type than the YAML type would equal to. But that's no longer parsing YAML according to the specs. This type of customization can easily be implemented as a converter:

require "yaml"

module EverythingAsString
  def self.from_yaml(context, node)
    node.as(YAML::Nodes::Scalar).value
  end
end

class NameFile
  YAML.mapping(
    item_name: {
    type:    String,
    key:     "itemName",
    converter: EverythingAsString
  })
end

pp NameFile.from_yaml("itemName: 6E-578") # => #<NameFile:0x564f4091ae40 @item_name="6E-578">

carc.in

straight-shoota on 10 Mar 2018

@straight-shoota What I'll talk about now refers to YAML 1.2, which is the latest YAML spec, and which is what YAML tends towards. I think the implementation in Crystal is for 1.1 though maybe it should be updated to 1.2.

In any case, in YAML there are recommended schemas:

That means that a YAML documented can be interpreted in many ways, one decides which schema to choose.

Now, "schema" is similar to "mapping" in my mind.

When you do YAML.parse(...) we have to choose one schema. I think we use the Core schema from 1.1 with some extensions. These extensions are recommended (like timestamps) but they are not mandatory. Everything is pretty much optional in YAML.

But when we do Type.from_yaml, well, we are specifying how we want that value to be parsed. So I think for String.from_yaml, it should work with a value of "123", unless that value is explicitly tagged as an integer with the "!int" tag.

In the absence of a tag, the value is left to interpretation by a tool.

Anyway, this is just my opinion. We either change this in Crystal, or the user has to put quotes around values that might look like numbers, timestamps, booleans, etc.

asterite on 10 Mar 2018

The use case of this document is it maps an item_id to an item_name, which other yaml files only reference item_ids. The file is generated by a third party whenever an update comes out for the game.

I am with @asterite on when i am defining the mapping to be a String, i would expect the result to be casted into a String. The converter is a nice workaround but not ideal behavior.

Blacksmoke16 on 10 Mar 2018

Hmm i'm starting to come around to @asterite's way of thinking on this issue. I'd like to see how existing libraries in existing statically-typed languages handle this (go, java, c#, etc...)

RX14 on 10 Mar 2018

Here's an example in Go:

package main

import (
    "fmt"
    "log"

    "gopkg.in/yaml.v2"
)

var data = `a: 123`

type T struct {
    A string
}

func main() {
    t := T{}

    err := yaml.Unmarshal([]byte(data), &t)
    if err != nil {
        log.Fatalf("error: %v", err)
    }
    fmt.Printf("--- t:\n%v\n\n", t)
}

Works well. I mapped a string, the value was 123, it works.

asterite on 10 Mar 2018

Here's for C#: https://dotnetfiddle.net/HD2JXM

In that snippet, try changing any value that looks like a string to a number. It works fine.

Then we have Nim that I also mentioned (I read it somewhere, I'm lazy to search it now).

So for now, dynamic languages always look at the shape of the value (they have no choice) while static types respect... types ;-)

I suggest to change it to how it works in other statically typed languages.

asterite on 10 Mar 2018

👍2

Well, i'm convinced.

RX14 on 10 Mar 2018

Crystal's YAML parser already targets YAML 1.2 Core Schema. YAML 1.1 spec doesn't know any recommended schemas.

The relevant part is how Tag Resolution is handled in these Schemas:

Failsafe Schema Tag Resolution:

All nodes with the “?” non-specific tag [this includes plain scalars] are left unresolved. This constrains the application to deal with a partial representation.

This is essentially what this is issue is about: With this schema, it is entirely the application's responsibility to decide how plain scalars are mapped, and this could even mean the same value might be interpreted as different types, depending on the context (so you could have itemName: 6E-578 map as string and number: 6E-578 to map as float.

JSON Schema Tag Resolution:

Scalars with the “?” non-specific tag (that is, plain scalars) are matched with a list of regular expressions (first match wins, e.g. 0 is resolved as !!int). In principle, JSON files [more precise: YAML files with JSON Schema] should not contain any scalars that do not match at least one of these. Hence the YAML processor should consider them to be an error.

In the JSON Schema, a plain scalar could not even be a string, because strings need to be explicitly tagged or quoted.

Core Schema Tag Resolution:

Scalars with the “?” non-specific tag (that is, plain scalars) are matched with an extended list of regular expressions. However, in this case, if none of the regular expressions matches, the scalar is resolved to tag:yaml.org,2002:str (that is, considered to be a string).

In Core Schema, a plain scalar in a number format will always be mapped to a number type, not to a string.

According to that schema, the plain scalar 6E-578 would be interpreted as a floating point number. A more obvious example, 123 would be an integer. When the YAML parser says "this is a number" and the Crystal type says "I want a string", there is a discrepancy between expected and actual data type. By default, this should be considered to be a parser error, because obviously the producer and consumer of the YAML data seem to have a different understanding of the data format.

Apart from fixing the data representation, the consumer should be able to resolve this. The parsed value could be converted to the expected type in Crystal using the appropriate conversion method. This is totally valid and can be used to implicitly map to a Crystal datatype not available in the YAML schema - for example Time.from_json uses this as well as the mappings for Crystal objects and collection types (although not scalars).
However, this does not result in the verbatim YAML scalar and therefore probably won't help for the use case in this issue (6E-578.to_s results to 0.0 and not "6E-578.to_s").

The only possibility to solve this use case in a generalized way (i.e. not using a custom converter for the specific property) is to use a different schema such as Failsafe Schema which allows the consumer to interpret plain scalar data types at will.

I assume that the other YAML libraries in the referenced examples use this schema (or a different one) and therefore use the data format defined in the language type as source of truth. This can be the intended way of dealing with this, or not.

Consider the case, I'm using a YAML file to model some data, where one specific is property is considered to be a string and the YAML deserialization implemented in Crystal behaves as such. Now let's assume the requirements change and in the data format this specific property can now be either a string or an integer which convey a different meaning each. The Crystal program was not updated to include this change but parses YAML data with the updated format. If all values are simply interpreted as string, it will go unnoticed that 123 is incorrectly considered to be a string "123" and the entire data model becomes actually invalid because the meaning was not transferred. Having type safety when mapping YAML types to Crystal types ensures to avoid such hidden failures.

The Core Schema is intended to ensure this and we must not change it's behaviour of parsing plain scalars.

The ideal solution to the OP's problem is probably to implement the Failsafe Schema which would map plain scalars to the type defined in the Crystal mapping.

Implementation should not be difficult, the only question is how to design the API for choosing between schemas and which one to use as default. The API could probably just be a default_schema argument for from_yaml method which can receive YAML::Schema::Core or YAML::Schema::Failsafe (or YAML::Schema::JSON eventually). YAML serialization could be similar, but I guess that would require some more thinking.

I would strongly recommend to keep the Core Schema as default because it is the strictest and safest alternative. Diverging from that should be opt-out. The YAML 1.2 specs also recommends it as "default schema that YAML processor should use unless instructed otherwise."

straight-shoota on 11 Mar 2018

I think we all agree that the YAML parser should default to the Core Schema

Yes, in the absence of types.

If I say to parse something as a String, I want a String. All libraries I presented do this. Why do we have to do it in a different way? Everyone is wrong except us?

asterite on 11 Mar 2018

The YAML 1.2 specs also recommends it as "default schema that YAML processor should use unless instructed otherwise."

Yes, a type declaration is means "instructed otherwise". In the absence of instructions (types), use the Core schema. If I want something as a type, parse it as that type.

Come on, every other library does that...

asterite on 11 Mar 2018

In your last comment, you're mixing up two different layers:

A type declaration declares what type the YAML parser is expected to return. This does not instruct the parser to use a different schema than Core.

The schema determines how plain scalars are to be interpreted as YAML types and the data type determines how to map the YAML type to a Crystal type. The first can be changed by using a different schema, the second by using a different converter.

straight-shoota on 11 Mar 2018

I couldn't find any documentation about why they implement it this way or even if this is intentional by an informed decision. If there is anything on that, I'd be happy to read it.

everyone else does it not necessarily means it's best. I am convinced it is better to be type safe by default and explained my reasoning.

What would I have to do to get the behaviour I described, that mismatching data types between YAML parser and Crystal type declaration fail? I consider this the safest and would want my applications to behave this way, even if I had to explicitly state it somewhere if it was not the default?

With my proposal it would be quite easy to choose: Use Core Schema for type safe mappings and Failsafe Schema to give precedence to Crystal type declarations.

Otherwise I see no simple way to do this besides adding another configuration option to from_yaml. But it doesn't really make sense to me. What speaks against this?

Whichever be the default is a different question, I could live with Failsafe Schema if I have the chance to choose Core Schema to get type safety. Still, I would definitely prefer the default to be the stricter and recommended standard.

straight-shoota on 11 Mar 2018

I think having Core Schema be default and Failsafe Schema to define mapping based typings would be a good solution. It keeps current functionality while allowing more control over how the YAML file gets parsed when using the other schema.

Blacksmoke16 on 11 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Inline Assembly in Crystal

nabeelomer · 3Comments

Rewrite sigfault.c in Crystal

asterite · 3Comments

spec --location to work on any line in indvidual spec

will · 3Comments

Object created by Box.box is being collected by GC

Sija · 3Comments

Compiler crash: "Module validation failed: Terminator found in the middle of a basic block!"

will · 3Comments