Crystal: [RFC] Out of the box JSON/YAML serialization

Created on 1 Jul 2018  ·  38Comments  ·  Source: crystal-lang/crystal

Even though JSON::Serializable is a big improvement over JSON.mapping, having to include JSON::Serializable in every type feels like a burden.

In many other languages, for example, C#, Go, and I guess Java, serialization works out of the box through runtime reflection. We can do the same in Crystal, with compile-time reflection.

The way to do this is this: we define Object.from_json(string_or_io), on every Object. What this does is create an instance with allocate (unsafe), create a JSON::PullParser and pass it to an instance method named initialize_from_json. This method is very similar to the initialize(pull_parser) we have. Because this method will raise if a property is not found in the JSON source for a required property (a property without a default value, and not nilable), it's totally safe.

The benefit is not having to require JSON::Serializable in every object that needs to be serialized. You could do:

require "json"

record Point, x : Int32, y : Int32

# Works!
Point.from_json(%({"x": 1, "y": 2}))

and it will work right away. Serialization can still be controlled via attributes.

Take for example https://github.com/crystal-lang/crystal/pull/6308 . It's nice, but having to manually include the JSON::Serializable module feels redundant, especially for records and simple types.

Another benefit is that no module is included, which means #6194 won't trigger (for this case).

Yet another benefit is that imagine you define some simple types in your library. If people want to serialize them, they can do it right away, no need to reopen them and include JSON::Serializable in case the author forgot.

The diff to achieve this is very simple, ~except there's a compiler bug that prevents this from working (I can work on that)~ I have it working locally.

I'm pretty sure many will say it might be "dangerous" to allow serialization on every type. I don't find it like that. And given that there's precedent in basically every other language out there, I think it's fine.

Thoughts?

Most helpful comment

Aparte (not exactly this topic) about Rust's Serde: it's "data model" is an API interface to a fixed list of supported data types (29 in total) such as integers, floats, strings, arrays and hashes (but also structs, enums, tuples, ...).

There are 2 traits (interfaces) : Serialize and Deserialize to be implemented. They don't have to allocate more memory than needed, they're just an interface. It's similar to our X::PullParser and X::Emitter, but with a standard interface (e.g. emitter.map instead of json.object or yaml.mapping or ...).

Having a standard interface for (de)serialization would be as wonderful as having a standard interface for IO.

All 38 comments

Another thing, right now the compiler complains in this case:

class Foo
  property x : Int32
end

The error is:

instance variable '@x' of Foo was not initialized directly in all of the 'initialize' methods, rendering it nilable

That's cool!

However, once we include JSON::Serializable that error is gone:

require "json"

class Foo
  include JSON::Serializable

  property x : Int32
end

# No error!

The reason is that the include defines an initialize method that makes sure to initialize all instance vars. So there's already an initialize doing that and the ability for the compiler to check that x wasn't initialized is lost. In fact... I just found a bug:

require "json"

class Foo
  include JSON::Serializable

  property x : Int32
end

p Foo.new # => #<Foo:0x104ff0ff0 @x=0>

This happens because of a compiler bug (what a surprise) related to these magic initializers that use @type inside them. There's a rule for making those initialize magically work, bypassing the usual checks. And it seems the checks are lost in other cases. I coded that specifically with JSON::Serializable in mind, and now I regret it because it's a complex thing to check, and it's magical.

By removing that "feature", the only way we can achieve this functionality is the way I propose above. It can be with a JSON::Serializable module or without one (I prefer it without one, less typing and works-out-of-the-box).

Even if we somehow fix the above bug, when we do Foo.new we'll get an error saying no overload matches... there'snew(JSON::PullParser). That includingJSON::PullParserdefines aninitializemethod for you is a bit annoying, because it now participates in overloads too! And because the lookup for initialize methods is a bit different than regular methods (initializeis not inherited unless the child type doesn't define new constructors) we have to do thatmacro included/inherited` hack.

I think this change will simplify things a lot:

  • The methods are just defined once (from_json, initialize_from_json)
  • No need to include a module (less typing, and faster compile times and method lookup times (no need to go up the hierarchy to find something)
  • Doesn't rely on compiler hacks (the @type thing inside macros inside initialize), which we'll be able to remove after this change (well, the way I propose is unsafe, because it uses allocate, but the generated method will raise if any of the required instance vars are not present, making it safe... kind of similar to how Array uses Pointer internally but it's still safe)

Thoughts again? :-)

Ah... and the downside is that if you are defining models for consuming an API, you'll actually have to write a bit more:

require "json"

class SomeModel
  include JSON::Serializable

  property x : Int32
  property y : Int32
end

# Compiles

With the new way:

require "json"

class SomeModel
  property x : Int32
  property y : Int32
end

# Doesn't compile: neither x nor y were initialized in an `initialize`

The way to fix it is to define a constructor:

require "json"

class SomeModel
  property x : Int32
  property y : Int32

  # For example:
  def initialize(*, @x, @y)
  end
end

# Now compiles

If the model has a lot of properties, you might have to write a big constructor... Alternatively, define default values for some properties (if the API is guaranteed to return some properties, giving default values for them is acceptable because you know you'll get correct responses). Anyway, defining a constructor isn't that bad, you can just do def initialize(*, put all required properties here).

So, yeah, in some cases you'll have to write more, you won't have to write include JSON::Serializable but you'll have to write a constructor. But I think this is good, as it makes you think how the class should be initialized outside of JSON. Imagine someone decides to use that class for some reason, maybe in a test, and parsing it from JSON all of the time might not be ideal, so having a constructor is good.

Thinking on a broader scale: JSON/YAML are probably the most important serialization formats. But there are many similar use cases. Should this be the recommended approach for other mappings (XML, Message Pack, DB bindings, etc.) as well? I can't think there would necessarily be any downsides to this. Other than a bunch of methods available on every object, which will often not be really useful unless it's actually supposed to support serialization.

I think this is too magic. I'd like a way to do generic serialization/deserialization, but I don't think this is the way.

The way serde does it it by describing a "universal data model" then everything just maps to that data model. And then that data model is mapped to JSON or YAML or whatever. I think we should do the same. And Serde requires explicit #[derive(Serializable)] to work, why shouldn't we require explicit includes?

  1. Regarding a constructor hack, maybe a better way to go is to allow a different way of declaring constructors that are not checked by the compiler and that is the programers responsibility to comply with some sound contract. That way there is no hack in the compiler, but a convention.

  2. I don't think serialization should be enabled for all types. I'm ok in adding serialization by default in records for example.

  3. Using the out-of-the-box serialization to avoid triggering #6194 is not worth it IMO.

@bcardiff

1. I don't like that - self.allocate is the unsafe way of declaring a constructor. We just need to automate adding finalizers so all constructors are always obj = Class.allocate; obj.initialize(*args); obj. Then that becomes API.

2 and 3. Absolutely agree

I'm thinking that allocate should add the GC finalizer, maybe. Then that problem will be solved.

I think Rust needs a Serializable attribute because they don't have reflection. With reflection you don't need an attribute.

Serde looks nice, but I can't understand how it works. It seems you have to define how a language type maps to a serde type, and then how to serialize that type into the output, and maybe also the other way around. And then you have generic serialization attributes.

If we could go that way, I think it would be awesome. Less repetition if a type is to be serialized into multiple formats.

I wonder how's the efficiency, though. If you have an Array, you have to map it to a serde array? And then serialize that array? Maybe Rust can do that efficiently, I don't think we can, so it will double the number of allocations...

However, I still think that being able to serialize any type without including a specific module would be nice. Again, the reason languages don't support this is because they lack reflection and the power to do it.

Alternatively, if it can be implemented without a compiler hack, that's another solution.

@asterite if you think of it in the JSON::Emitter and JSON::PullParser level - there's no efficiency problem. The "data model" is just a standardized interface to the pull parsers and emitters. So we'd have a Serialization::PullParser module with abstract methods which JSON::PullParser would implement and same for emitters. Then you could just do JSON.build do |emitter|; my_obj.serialize(emitter); end and ditto for pull parsers. Then shortcut that up in from_json and to_json and you don't even really have a big breaking change.

And +1 to allocate adding the finalizer.

@RX14 I still don't understand how you would avoid that indirection and those extra allocations. I guess I'd have to still a proof of concept implementation to understand it :-)

I was wrong about allocate adding the GC finalizer. new invokes allocate, then initialize, then adds the finalizer. But if initialize raises then the finalizer must not be added. So I guess we need a method that does all of that at once, as @bcardiff suggests.

i think include is not that bad, and this in ruby style also, when you extend class with some new behavior by include module to it. Adding it to Object would be more magical.

@asterite because there's no indirection. It's just a set of methods pull parsers and emitters implement.

Then, how about we keep JSON::Serializable but the deserialization doesn't happen through a constructor? That way we don't have to use the {{@type}} hack, which also currently breaks other constructors (as I showed above). And that way we don't need to use the included and inherited hacks either.

sounds cool, but how it would look like? any example?

@kostya This commit from this branch has it working. As you can see:

  • the included, inherited and dummy parameter are all gone
  • I had to fix specs to specify initialize, because now JSON serialization doesn't count as an initializer (and it avoids the bug I mentioned above). I think this is good, but some can find it annoying
  • new is replaced with from_json. To implement from_json we use allocate and initialize_from_json. It's "unsafe", but so was the old new method.

i think initialize is not big problem because it also can be generated: https://github.com/kostya/auto_initialize?

@kostya Looks good! So simple :-)

Well, everyone (cc @RX14 @straight-shoota @ysbaddaden) let me know if you think this is a good direction (using from_json instead of new), and I can send a PR for JSON and YAML

I'm not sure I like it. I'm not sure why we need to start adding all these fake constructors with this method. I also don't like that it means that any other library using this method now has to generate custom constructors too. So at the very least it should be coupled with a generic multi-format serialization interface. And making serialization implicit is a nono, we should have to include something every time. And then there's the cost of having everything change around all the time. People do write custom self.new(PullParser) methods, now either those manually written methods have to be unsafe and write a self.from_json and an initialize_from_json, or they have to write a self.from_json alias for their existing self.new.

And what about default initialized variables such as @foo = "" which were ignored in the parsing? Wouldn't they break with this new method?

And what are we trying to avoid here? Fixing a compiler bug? I don't get why we have to change things and write unsafe custom-named constructors?

@RX14 I already said that JSON::Serializable would stay, it just wouldn't be a constructor.

This is the main problem:

  • JSON.mapping was a macro that defined a constructor along with some properties. The constructor could then be checked by the compiler to see if all instance vars were properly initialized
  • When we changed it to JSON::Serializable and using compile-time reflection based on the instance vars, the compiler can't check that all instance vars are properly initialized anymore, at least not in the first pass (before instantiating methods) because now the initialize is a macro method. So workaround that, I added a rule in the compiler to not check that initialize macro methods are properly initialized. This is a hack I introduced and I regret. I tried to make it work several times and I can't find a way to fix it. That's why the above Foo.new works when it shouldn't, and it's unsafe.
  • Then, because constructor lookup is different that method lookup (constructors aren't initialized), we need to use included and inherited macros to make it work with inheritance. This is a second hack.
  • What I'm proposing now is to implement all of this without the need for compiler hacks nor included/inherited hacks, and also without JSON::Serializable creating a constructor for you, which means you'll have to define a proper constructor, like it has always been the case.

Alternatively, someone can try to fix that compiler hack, but the included/inherited hacks will still be there... and my bet is that this won't be fixed anytime soon (I wrote the code and I don't understand it anymore, so I doubt someone will understand it quickly).

Also, serialization in other languages doesn't involve adding new constructors to types. It's a bit weird that a constructor is injected for you. What you want is just a way to create that object from JSON, and that doesn't have to necessarily be a constructor.

Anyway, this is just a minor thing. People are doing fine with some existing compiler bugs and segfaults, so this has no rush.

So the issue is:

  • To fix the compiler bug, we need to revert the hacks, and make it impossible for JSON::Serializable to work as it is now.
  • The only way to make it work is to generate a "fake" constructor manually using self.allocate.

Okay that's fine, but why does the new constructor need to be called self.from_json. Why not self.new and rename the initialize to initialize_from_json just the same? Is there anything special about methods called self.new? I thought there wasn't.

This way it's not a breaking change, just an implementation change.

Yes, if a type defines an initialize method, then all initialize and self.new methods are not inherited. So this doesn't compile:

class Foo
  # Let's assume this is the constructor that gets injected in JSON::Serializable
  def self.new(x, y)
  end

  def initialize(x : Int32)
  end
end

class Bar < Foo
  def initialize(x, y, z)
  end
end

Bar.new(1, 2) # This doesn't work

I'm not sure it's the right logic, it's hard because there's new and initialize where in many language there's just new and the constructor is a fixed construct (there's no self.new in other languages). But, well, it's how it works.

I think I'll close this issue... maybe someone will have to fix the compiler issues/hacks regarding this and we'll be fine.

@asterite No, I really can't imagine how a constructor which is also able to inspect all the instance variables using macros could work. I think this should be removed.

I think that manually-defined self.new should be inherited normally, only the autogenerated ones from def initialize should not be inherited.

Besides, I don't know any other language which inherits global ("static") methods anyway. What's the usecase?

And how can serialization make sense with inheritance anyway. I think it only makes sense for Java and C# to have "magic" serialization and have inheritance because they can always stick null in all of their variables. A subclass magically getting all it's ivars serialized without asking is a bit weird to me.

I think I'm starting to dislike JSON::Serializable entirely, perhaps the only problem with JSON.mapping was people treating them like objects and adding ivars which aren't serialized and inheritance and weird things. Maybe we should have just a json_record macro which is literally JSON.mapping but it generates the whole class and force people to use that. Perhaps it makes people stop adding behaviour and using inheritance with JSON stuff. If they need all of that, they can always wrap it. But then there's all that extra typing.

Besides, I don't know any other language which inherits global ("static") methods anyway.

What do you mean? Ruby does it, PHP also.

We only do it because ruby does it. PHP is a joke.

@RX14 Java and Python do too... Please, check your facts before posting.

Oh, I thought that java didn't inherit static methods. Never mind then. My bad.

I'm closing this. The state of the art for JSON/YAML looks good so far, even though there are some corner cases. Sorry for the noise.

@asterite maybe wait to have other opinions from e.g @ysbaddaden and @bcardiff, before closing abruptly like you did?

Nah. I opened it, I can close it.

I mean, I'm no longer interested in this happening.

I'm definitely interested in fixing the bug. And I think reverting the compiler patch that made the current JSON::Serializable work and then doing it manually using def initialize_with_json is fine. It's doable now.

Then you should open a separate issue. The original issue was to have from_json/to_json available in all objects. And you are right that initialize_with_json is unsafe, so that shouldn't be done.

I like the idea of taking a step back from an immediate implementation. Let's think about this for some time. The custom annotations feature is quite young (and awesome!), let's experiment with that. We can explore alternate ways to implementing serialization - it couldn't hurt to look more deeply at other implementations. For now, the current way should be fine. I'm pretty sure we'll end up with some changes sooner or later, but time will tell.

@RX14

Maybe we should have just a json_record macro which is literally JSON.mapping but it generates the whole class and force people to use that.

This sounds a bit like Java Beans. Might not a bad idea. But maybe it is. 🤷‍♂️

It's definitely beneficial to question the current (and proposed) approaches and think about new ideas.

Aparte (not exactly this topic) about Rust's Serde: it's "data model" is an API interface to a fixed list of supported data types (29 in total) such as integers, floats, strings, arrays and hashes (but also structs, enums, tuples, ...).

There are 2 traits (interfaces) : Serialize and Deserialize to be implemented. They don't have to allocate more memory than needed, they're just an interface. It's similar to our X::PullParser and X::Emitter, but with a standard interface (e.g. emitter.map instead of json.object or yaml.mapping or ...).

Having a standard interface for (de)serialization would be as wonderful as having a standard interface for IO.

Going back to this topic: Rust's Serde can automatically generate the implementation using an explicit #[derive(Serialize, Deserialize)], similar to include JSON::Serialization but it doesn't have to generate a constructor, because Rust's structs have a global initializer, for example Point(x: i32, y: i32).

Following the forum discussion, I got this basic working draft in this gist, which may or may not be the right approach.

At least, it shows an idea of a possible API.

struct Point
  getter x : Int32
  @[Serialization(key: "YYY")]
  getter y : String

  def initialize(@x, @y)
  end
end

data = %({"x": 1, "YYY": "abc"})

point = Point.from(JSON, data)
puts point # => Point(@x=1, @y="abc")
{YAML, JSON}.each do |type|
  puts point.serialize type
  # => {"x":1,"YYY":"abc"}
  # ---
  # x: 1
  # YYY: abc
end

Or, do we want to go through bytes for serializing, like Go marshaling?

Additional note

Having multiple serialization mechanisms (2 actually) isn't ideal, even less by adding another one.
Having a plan to only keep 1, at most, would be a good to have before 1.0.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

oprypin picture oprypin  ·  3Comments

lbguilherme picture lbguilherme  ·  3Comments

costajob picture costajob  ·  3Comments

grosser picture grosser  ·  3Comments

pbrusco picture pbrusco  ·  3Comments