Marshmallow: Reduce confusion around fields.Pluck behavior

Created on 20 Jul 2019 · 6Comments · Source: marshmallow-code/marshmallow

Follow-up to discussions in https://github.com/marshmallow-code/marshmallow/pull/1298#issuecomment-513354502 and https://github.com/marshmallow-code/marshmallow/issues/1311.

@deckar01 wrote:

My confusion with the behavior of Pluck concerns me a little. Currently Pluck = flat -> nested -> flat, but it is easily confused with this use case nested -> flat -> nested. Pluck describes the operation during serialization which is similar to Function and Method, but it is really the opposite operation during deserialization. Would it make sense to provide a field that performs the opposite operation? Nest?

@sloria wrote:

I agree that Pluck is easily misunderstood. I improved the documentation a bit in #1314. Perhaps we could improve the naming to make it more clear, as well as provide a field for the reverse operation (Nest? Deep?)

One possible solution might be to rename (or alias, to not break compat) Pluck to Flat and provide a Deep field which does the opposite (what I called Reach in https://github.com/marshmallow-code/marshmallow/issues/1311#issuecomment-513478342).

Feedback welcome!

feedback welcome

Source

sloria

👍1

Most helpful comment

I've been revisiting the use of Reach which just so happened to be required in the second use case of my project using Marshmallow and have spent a bit of time trying "back of the napkin" a solution for this.

Before I begin, another issue I ran into with using Reach with identical data keys was that when I got validation errors they all read {'data_key': ['Not a valid integer.']}. Since almost all my fields had the same data_key and the validation also did not print the value that failed to validate the validation errors may as well of been printing nothing.

Alright, back to the "dotted path issue".

Personally I feel that the use case of reading some value out of a nested location on source data should be supported at the base Field level. I don't think the user should have to write Reach fields, or Pluck fields or any variation of field that requires one of it's arguments to be define the _actual data type_ of the field.

With your point in mind that some json keys themselves may have '.'s in them I have tried to layout a test that has some worst case fields and then try and figure out how I'd like to be able to use fields.IntegerField() (for example) to read the varying pieces of data.

Here it is:

    {
        'nested': {
            'foo': 1,
            'bar.baz': {
                'foo': 2,
            }
            'bar': {
                'baz': 3,
            }
        }
        'nested.foo': 5,
    }

I liked the idea of the keyword argument path from the Reach recipe and thought it could be possible to simply incorporate it into a base field. You'd then be able to choose to use path (for deep json lookup) or data_key for the simple use case of pulling it from the top level.

I also realised with my above example that path could not simply be a dotted string since we'd have no way to know what to return with path=nested.bar.baz in the above example.

I propose something along the lines of passing the following kwargs to fields.Integer(). There is no uncertainty even in this fairly extra situation _where_ the data should be read from.

    fields.Integer(path=['nested', 'bar.baz', 'foo'], data_key=None)    ----> 2
    fields.Integer(path=['nested', 'bar', 'baz'])     ----> 3
    fields.Integer(data_key='nested.foo')      ----> 5

I'd also be happy with data_key being sent as _either_ a single string object or a list of strings however this adds some annoying type checking all over the place when initialising fields.

I'd be keen to belt out an MR for this (or something similar that lets me use Base Field types) which at the same time solves the duplicate data_key issue and the Validation data_key issue and means I don't have to use the ReachField() workaround. Thoughts?

davidlouis on 5 Aug 2019

👍5

All 6 comments

Some feedback on using Reach with my travels so far:

I had some trouble deserializing data similar to:

{
    'foo': 1,
    'data': {
        'bar': 1,
        'baz': 2,
    }
}

Using the following schema:

class ReachTestSchema(Schema):
    foo = fields.Integer()
    bar = Reach(fields.Integer(), data_key="data", path="bar")
    baz = Reach(fields.Integer(), data_key="data", path="baz")

The initialisation of this schema failed because of the duplicated data key "data" I needed to load two values from the same nested json key. I have temporarily worked around this by creating my own ReachSchema that subclasses Schema and overrides _init_fields to remove the duplicate data key validation check.

I hope to revisit soon, but something to consider for anyone else looking at this

davidlouis on 23 Jul 2019

Ah, good point--I hadn't considered duplicate data_key. Might have to rethink the recipe a bit.

In the specific case you posted, it might be better to use a post_load method:

from marshmallow import Schema, fields, post_load


class DataSchema(Schema):
    bar = fields.Integer()
    baz = fields.Integer()


class MySchema(Schema):
    foo = fields.Integer()
    data = fields.Nested(DataSchema)

    @post_load
    def flatten_data(self, in_data, **kwargs):
        data = in_data.pop("data")
        in_data.update(data)
        return in_data


data = {"foo": 1, "data": {"bar": 1, "baz": 2}}
print(MySchema().load(data))  # {'foo': 1, 'baz': 2, 'bar': 1}

sloria on 23 Jul 2019

👍2

Alright, back to the "dotted path issue".

Here it is:

    {
        'nested': {
            'foo': 1,
            'bar.baz': {
                'foo': 2,
            }
            'bar': {
                'baz': 3,
            }
        }
        'nested.foo': 5,
    }

I also realised with my above example that path could not simply be a dotted string since we'd have no way to know what to return with path=nested.bar.baz in the above example.

I propose something along the lines of passing the following kwargs to fields.Integer(). There is no uncertainty even in this fairly extra situation _where_ the data should be read from.

    fields.Integer(path=['nested', 'bar.baz', 'foo'], data_key=None)    ----> 2
    fields.Integer(path=['nested', 'bar', 'baz'])     ----> 3
    fields.Integer(data_key='nested.foo')      ----> 5

I'd also be happy with data_key being sent as _either_ a single string object or a list of strings however this adds some annoying type checking all over the place when initialising fields.

davidlouis on 5 Aug 2019

👍5

Thanks for that analysis @davidlouis . I think allowing for data_key to be a list of strings could be a nice solution. Feel free to send a PR (even a work-in-progress one). We might not get to it immediately, since we're focused on 3.0 final, but I do think we'll want a good solution for handling nested keys.

sloria on 13 Aug 2019

Thanks @sloria. Yeah I like having data_key support a list of strings too. I think it's better than using the additional path param.
Will update with a PR when I find some time!

davidlouis on 14 Aug 2019

I would prefer having both path and data_key for the reason that the data_key value is sometimes derived from the field name, i.e. when it's converted from snake_case to camelCase.

If the path would be an independent parameter, it would make it easy to keep the case conversion automatic, while still being able to set up nesting.