Serde: Repeated, Interleaved Tags in XML

Created on 21 Jan 2020  路  7Comments  路  Source: serde-rs/serde

This is a follow-up from here. I would like to know if you consider this a bug. I suspect that this is particular to XML which allows repeated tags.

The following fails to de-serialize with Serde.

<Xml>
    <field1 name='a'/>
    <field2 name='b'/>
    <field1 name='a'/>
</Xml>

Note how field1 is repeated but with field2 interleaved. (See linked issue [1] for complete example including version numbering.)

Error:

Error: Custom("duplicate field `field1`")',

Is there something that can be done with this in _serde_? Or is there anything that can be done to fix this on the quick_xml side?

The workaround is extremely cumbersome.

Most helpful comment

Duplicate Interleaved tags is also a problem in JSON.
For example:

{
    "field1": "a",
    "field2": "b",
    "field1": "c",
}

Duplicate tags are allowed: https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object
So this should also be allowed.

This might seem as a issue that does not come up. But I had this issue with a Wireshark export in: https://github.com/serde-rs/json/issues/652 And later with interleaved tags in Wireshark. Did not report it as I solved it with this https://github.com/ralpha/serde_deserializer_best_effort. But that is more of a hack then a solution.

And just our of curiosity:

this is far from the only such missing piece

What are the pieces are you referring too?

it would be better for the ecosystem if we build strongly typed XML libraries on a separate derive macro dedicated to XML

What do you suggest, just a fork of Serde that is changed for this (any maybe other xml) behavior? Or would you suggest starting with something totally different?

All 7 comments

I have run into the same issue(also a quick-xml user), where duplicate field name is actually expected for collecting into a vec, but the values are different.

<material name="mtr_head" render_template="solid_mask" unique="true" version="2">
    <diffuse_texture file="units/pd2_dlc_ami/masks/ami_01/ami_01_df" />
    <bump_normal_texture file="units/pd2_dlc_ami/masks/ami_01/ami_01_nm" />
    <variable value="0" type="scalar" name="alpha_ref" />
    <reflection_texture file="units/payday2/matcaps/matcap_plastic_df" />
    <material_texture file="units/payday2/masks/shared_textures/patterns/pattern_no_color_no_material_df" />
    <variable value="0 0 0" type="vector3" name="tint_color_a" />
    <variable value="0 0 0" type="vector3" name="tint_color_b" />
    <variable value="1" type="scalar" name="material_amount" />
</material>

This has worked well, up until running into a file with contents above where a stray variable field appears earlier, causing the same duplicate field error. It seems to parse the first field(s), until that is interrupted, and a subsequent sequence will raise the error instead of collect into the same vec.

Probably not of much value, but related serde rust structs:

#[derive(Debug, Deserialize, PartialEq, Default)]
struct Material {
    name: String,
    unique: Option<bool>,
    render_template: String,
    version: u8,

     // filters via custom deserializer for `_texture` suffix entries
     #[serde(flatten, deserialize_with = "de_texture")]
     textures: HashMap<String, Texture>,

    // Collects `variable` entries and through some `From` impl, converts to HashMap, 
    // grouping related variables
    #[serde(rename = "variable", default)]
    variables: VarHash, // Could just be `Vec<Variable>` for this example
}

#[derive(Debug, Deserialize, PartialEq, Default)]
struct Variable {
    name: String,
    #[serde(rename = "type", default)]
    _type: String,
    value: String,
}

// Vec<Variable> => HashMap<String, <HashMap<String, String>> => HashMap<String, Vars>
#[derive(Debug, Deserialize, PartialEq, Default)]
#[serde(from = "Vec<Variable>")]
pub struct VarHash(HashMap::<String, Vars>);

I have shared on the linked quick-xml issue(https://github.com/tafia/quick-xml/issues/177) that I have worked around the issue for my data with two similar deserializers implemented for visit_map as the enum workaround isn't practical for my data source and gets in the way more.

If you don't have many fields(mine are variable fields otherwise fields with a specific suffix), then it's probably a better approach. In the case of variable, I would have expected it to be unnecessary, rename = "variable" should be able to collect all instances in the same manner as my deserializer did by filtering through all children?

I don't think there is any other easy way to work around this issue, it may be a limitation/feature of serde.

So to answer the first question, I don't consider it a bug and while I don't like to need to resort to workarounds I don't have anything better to propose for now.

I came across this issue multiple times now, I have tried to solve it.
I implemented my own deserializer. And got it working! :smiley:
I also tried implementing a derive macro. And after a lot of trail and error I got it working.
The result is not pretty, but you can find it here:
https://github.com/ralpha/serde_deserializer_best_effort
(it will probably not solve your problem, but might help you)

I included both the manual implementation and the derive trait.
The code could be so much nicer if it was included in the serde crate.
But I was not very familiar with writing macro's before this.
(A good thing that compilers course from a few year ago came in handy :wink: )
I don't know if this behavior could be implement in this or the serde crate/repo.

TL;DR: Solved it, but with its limitation: https://github.com/ralpha/serde_deserializer_best_effort

Also other linked issues:
https://github.com/RReverser/serde-xml-rs/issues/55
https://github.com/serde-rs/serde/issues/690
https://github.com/tafia/quick-xml/issues/177

I have been looking into this a bit more.
I looked at the serde-xml-rs code and the serde code. And the problem seems to be in serde not in serde-xml-rs or other packages that might inherit this problem.

(this paragraph is unrelated to the topic)
I tried to run and change it locally got some errors when compiling. Simplified the https://github.com/ralpha/serde_deserializer_best_effort code a bit for this local test and git clone the serde and serde-xml-rs repo. If I can run it locally I could try to implement a solution.

error[E0283]: type annotations needed
  --> src/main.rs:84:21
   |
84 | fn read_xml_file<C: serde::de::Deserialize<'_> +  DeserializeOwned>
   |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^ cannot infer type for type parameter `C`
   |
   = note: cannot resolve `C: _IMPL_DESERIALIZE_FOR_RootNotWorking::_serde::Deserialize<'static>`
   = note: required by `_IMPL_DESERIALIZE_FOR_RootNotWorking::_serde::Deserialize`

The error (when using #[derive(Deserialize)]) is generated by https://github.com/serde-rs/serde/blob/29be721f7929bc2a8f33c6e1d28225346558afd5/serde_derive/src/de.rs#L2428

So if we could do a test just before this line and check if the type we are trying to give a new value to is of a collection type, or just Vec<>, HashMap<>,... we can add the value to the list instead of giving an error.
We know the type of the field at compile-time so this could be resolved by the compiler (when generating this code)

let deser_name = field.attrs.name().deserialize_name();
let field_type = field.ty; // <--------- Adding this line, for example --------------

let visit = match field.attrs.deserialize_with() {..}

There are a few ways we could do this, here are 2 options:

Option 1

Do a test if the type implements the Extend trait (this include most of the wanted types). Although I'm not sure if or how this could be done (at compile-time or run-time). Maybe something like this: https://stackoverflow.com/a/30275713/2037998 We could generate the additional code for the check only if the type support the Extend trait. (if we can check this at compile time)

Option 2

This option is easier (but might be considered dirtier?). This is how I approached it in my proof of concept above. I added a new trait for all deserializible types ( DeserializeBestEffortTypes in my code ) We then know that all types have this trait (this could be the Deserialize<'de>) we can then add a new function (add_deserialized_data for example) to this trait.
This can be done in https://github.com/serde-rs/serde/blob/29be721f7929bc2a8f33c6e1d28225346558afd5/serde/src/de/mod.rs#L531 (Or somewhere that might fit better.) We can then implement this for every type (String, u32, ...) where it does exactly the same as it does now. And for Vec<>,... where we add the item to the list.
(code example here: https://github.com/ralpha/serde_deserializer_best_effort/blob/master/src/deserialize_best_effort.rs)

We know that every type has this trait and can thus call the function to add the value.
So we can do this:

__Field::#name => {
    if let Some(#value) = #name {
        if #value.deserialize_extendable(){ // check added to not change current behavior
              #value.add_deserialized_data(#visit);
        }
        return _serde::export::Err(<__A::Error as _serde::de::Error>::duplicate_field(#deser_name));
    }else{
        #name = _serde::export::Some(#visit);
    }
}

This could also allow behavior like what serde_with does now, but build in. Depending on how deserialize_extendable() and add_deserialized_data(#visit) are implemented.

This behavior could even be hidden under a new Container attributes to not break old code with manual implementations of Deserializer if that is a big concern.

This quickly gets complicated if you are not to familiar with the code, so sorry if it is not clear.

XML has always been a poor fit for Serde in my experience. I'll go ahead and close this because I think it would be better for the ecosystem if we build strongly typed XML libraries on a separate derive macro dedicated to XML, rather than pursuing XML-oriented features in Serde's traits and derives (this is far from the only such missing piece).

Duplicate Interleaved tags is also a problem in JSON.
For example:

{
    "field1": "a",
    "field2": "b",
    "field1": "c",
}

Duplicate tags are allowed: https://stackoverflow.com/questions/21832701/does-json-syntax-allow-duplicate-keys-in-an-object
So this should also be allowed.

This might seem as a issue that does not come up. But I had this issue with a Wireshark export in: https://github.com/serde-rs/json/issues/652 And later with interleaved tags in Wireshark. Did not report it as I solved it with this https://github.com/ralpha/serde_deserializer_best_effort. But that is more of a hack then a solution.

And just our of curiosity:

this is far from the only such missing piece

What are the pieces are you referring too?

it would be better for the ecosystem if we build strongly typed XML libraries on a separate derive macro dedicated to XML

What do you suggest, just a fork of Serde that is changed for this (any maybe other xml) behavior? Or would you suggest starting with something totally different?

Was this page helpful?
0 / 5 - 0 ratings