Serde: Advanced Serde

Created on 5 Sep 2017  路  9Comments  路  Source: serde-rs/serde

Let's put together some resources geared toward experienced Serde users to show off some examples of advanced usage. Hopefully this gives experienced users a way to level up even further and invent more amazing things.

docs

Most helpful comment

Ordinarily when you serialize a struct containing Rc nodes, the content of a single Rc is duplicated in each place the Rc is referenced. This is redundant and may be expensive if the Rc contains a large data structure. The serialized representation may be exponentially large compared to the size of the input data, depending on the structure of the DAG. Also it is lossy because when you deserialize, each Rc now points to its own copy of the data rather than one shared copy. We end up with a tree rather than the original DAG.

The link demonstrates an efficient way to serialize a DAG of Rc nodes using backreferences for nodes that have already been serialized. Deserialization then builds the same DAG we started with, correctly sharing data.

All 9 comments

Here is one to start off:

You have a large JSON object and you only care about a few of the keys. If you know at compile time what keys you care about, then serde_derive works great and you only put the keys you want in your struct. Other keys are efficiently ignored.

But if the keys you care about are only given at runtime, that won't work. Ordinarily you would deserialize the entire JSON document, whether into a struct or a Value, and then pull out what you need.

The link demonstrates an efficient way to deserialize only certain nested keys, not known at compile time.

Ordinarily when you serialize a struct containing Rc nodes, the content of a single Rc is duplicated in each place the Rc is referenced. This is redundant and may be expensive if the Rc contains a large data structure. The serialized representation may be exponentially large compared to the size of the input data, depending on the structure of the DAG. Also it is lossy because when you deserialize, each Rc now points to its own copy of the data rather than one shared copy. We end up with a tree rather than the original DAG.

The link demonstrates an efficient way to serialize a DAG of Rc nodes using backreferences for nodes that have already been serialized. Deserialization then builds the same DAG we started with, correctly sharing data.

The concept of serializer adapters and deserializer adapters in general, but a specific example:

Such adapters are verbose to implement, but the concept is powerful and flexible. This one gives the ability to efficiently detect unused keys during deserialization across any self-describing data format. For example Cargo uses this to generate warnings about extraneous keys in Cargo.toml.

The link shows how the toml crate exposes a way to interact with TOML's notion of a datetime. This is a clever example of mapping into Serde's data model. Essentially the data format's Serializer and Deserializer gets to define the bidirectional mapping between TOML datetimes and Serde's data model, then take advantage of this mapping in their toml::Value and toml::Datetime implementations of Serialize and Deserialize.

https://github.com/serde-rs/serde/issues/1041#issuecomment-327203961

I'd like to mention https://github.com/Marwes/serde_state for the case of serializing DAGs . It makes it possible to skip out on writing most of the boiler plate otherwise required to do this sort of serializing by allowing the state be implicitly passed through the serialization or deserialization process.

Of course it can also be useful for any other cases where state is required for proper serialization but serializing DAGs is what I use it for in any case.

@Marwes - would this let me pass a 'state' along with a serializer when serializing something? Currently I'm generating a JSON graph, and I need some state that is out-of-bound to serialize URLs (ie some data that does not live in the object graph being serialized).

If so that would be cool.

For the benefit of anyone else reading this, I currently solve this in one of three very clunky ways:-

  • Implement Serialize for the parent of the object graph being serialized and re-implement of most of Serde (yuck)
  • Pass the state in static thread local fields (yuck)
  • Use Rust's unstable specialization, by creating a trait that provides the state, implementing it generically for all Serializers and then providing a specialization the serializer I'll actually use (yuck)

@raphaelcohn Yep, that should work, you just need to create and pass the state at the top level and it should work (see https://docs.rs/serde_state/0.4.0/serde_state/ docs for 0.4.1 seems to be broken)

@Marwes - ta!

Not sure if this is the right place to ask, but are there resources on how to do efficient multi-pass deserialization? In ruma-events, we recently started adding Deserialize implementations for a Result-like type (with fixed Err type), where deserialization succeeds returning the Ok variant if deserialization into the Oks type worked, but still succeeds if that failed but deserialization into serde_json::Value is possible.

Our current implementation allocates more than we would like in the 'happy path' (deserialization succeeds with the Ok variant) - that is because we haven't found a better way than deserializing into serde_json::Value first and using serde_json::from_value to try to then deserialize the Oks inner type. If we try deserializing the inner type first, there is no way of then trying again with serde_json::Value, because the deserializer has been consumed. AFAICT, the only improvement that can be made is not allocating as much in the first deserialization pass.

To that end, I've looked at two possible solutions:

  • Using serde_json::RawValue: Looked good at first, however it doesn't provide an equivalent of from_value, only a method that returns a json string slice. Re-parsing that feels like a pretty ugly solution (information the first deserialization pass collects about the input structure not reused).
  • Use the serde-value crate: Unfortunately, it currently only provides a Value type that copies everything from the deserializer (see also https://github.com/arcnmx/serde-value/issues/15). Might still be faster than serde_json::Value deserialization, but I really don't know.

Have I missed something? Is this a use case that hasn't come up yet? Or is it just something that some people are interested in, but nobody cares enough to put in the required work? We really don't need the extra performance at the moment and haven't done any benchmarks on the existing solutions, I am mainly asking this out of curiosity.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

dtolnay picture dtolnay  路  3Comments

vityafx picture vityafx  路  3Comments

pitkley picture pitkley  路  3Comments

Yamakaky picture Yamakaky  路  3Comments

sackery picture sackery  路  3Comments