Marshmallow: dump-only fields are passed when unknown = INCLUDE

Created on 6 Jul 2018  ·  15Comments  ·  Source: marshmallow-code/marshmallow

When unknown=INCLUDE, dump-only fields are passed as unknown (and not validated).

This is due to https://github.com/marshmallow-code/marshmallow/pull/865 and is a side-effect of being considered unknown, but is this really what we want?

Modified test:

    def test_dump_only_fields_considered_unknown(self):
        class MySchema(Schema):
            foo = fields.Field(dump_only=True)

        with pytest.raises(ValidationError) as excinfo:
            MySchema().load({'foo': 42})
        err = excinfo.value
        assert 'foo' in err.messages
        assert err.messages['foo'] == ['Unknown field.']

        # Test with unknown=INCLUDE
        data = MySchema(unknown=INCLUDE).load({'foo': 42})
        assert 'foo' not in data  # fail

Excluding them seems inconsistent (why exclude a dump-only field and not a real unknown field if dump-only are treated as unknown?) but less risky.

I'm not saying this is a bug, but I just got surprised by it and it is not explicitly tested, so I figured I'd ask.

feedback welcome

Most helpful comment

I also find the current behavior to be the most intuitive. Unless there are any strong objections, I'll plan on merging #981 within the next few days.

All 15 comments

Good catch. I agree that excluding dump-only fields seems inconsistent. So I think the current behavior is correct. It'd be a good idea to explicitly test this behavior. Go ahead and add that assertion if you have time; otherwise, I can do it after work today.

I'm going to put a vote in favor of excluding dump-only fields when they're passed in. The dump-only field is "known," it's just a one-way sort of deal.

Also, if dump-only fields aren't excluded, it makes unknown=INCLUDE essentially override (without any validation) dump_only=True. This is particularly troublesome since unknown is declared on the Schema, and dump_only is declared on a field; specific (a field parameter) should take precedence over general (a schema parameter).

Of course, it would be _really_ nice to have a way to configure this, so we can each have our own opinions, but maybe that's ugly.

I'm going to put a vote in favor of excluding dump-only fields when they're passed in. The dump-only field is "known," it's just a one-way sort of deal.

I think this describes it well.

I can try to make PR for this. I just need to understand it better first because:

Similar problem seems to exist with unknown=RAISE: dump_only fields are first skipped in deserialization, but then regarded as unknown, which will lead to error if you you have unknown=RAISE. Wouldn't it be better to ignore the dump-only fields during loading and not raise error on them (as we're explicitly saying we are not interested about these fields during loading)?

It seems that current logic was explicitly added in https://github.com/marshmallow-code/marshmallow/pull/865 but I'm wondering if that's the right call? While it's consistent with current state of INCLUDE, it's against the intended behavior that this ticket suggests, which would be to:

1) Exclude dump_only fields during load() with unknown=INCLUDE (they are now included), and I think we should also similarly:
2) Not consider dump_only fields as unknown during load() with unknown=RAISE (they are now handled as unknown)

The idea being that in both cases, like @rileyjohngibbs said, dump_only fields are not really _unknown_ but rather known and ignored.

Of course, it would be really nice to have a way to configure this, so we can each have our own opinions, but maybe that's ugly.

Adding more unknown options like INCLUDE_ONLY or EXCLUDE_ALL wouldn't necessarily turn ugly.

Good points, @tuukkamustonen and @rileyjohngibbs . https://github.com/marshmallow-code/marshmallow/pull/865 was implemented for the use case of a client trying to modify a read-only field (e.g. "created_date"). In that case, the client should receive an error.

@lafrech @deckar01 Opinions on this?

dump_only - If True skip this field during deserialization, otherwise its value will be present in the deserialized object. In the context of an HTTP API, this effectively marks the field as “read-only”.

https://marshmallow.readthedocs.io/en/3.0/api_reference.html#marshmallow.fields.Field

"skip" implies that the field is allowed, but it will be excluded. This is consistent with the unknown=INCLUDE behavior, but not unknown=ERROR. I think to make this consistent we would either need to soften the "unknown" validation for dump_only, or harden the definition of dump_only to include validation.

with unknown=RAISE: dump_only ... wouldn't it be better to ignore the dump-only fields during loading and not raise error on them (as we're explicitly saying we are not interested about these fields during loading)?

I don't think so. The goal of unknown=RAISE is to warn developers that they are passing in data that is going to be lost. I think we should harden the definition of dump_only so that it doesn't violate the expectations set by the unknown option.

If dump_only meant "don't allow loading", then unknown=ERROR | INCLUDE would error and only explicitly opting into unknown=EXCLUDE would allow the data to be discarded.

If dump_only meant "don't allow loading", then unknown=ERROR | INCLUDE would error and only explicitly opting into unknown=EXCLUDE would allow the data to be discarded.

Wouldn't it be be confusing if unknown=INCLUDE actually raised error...?

There's also this way to look at things:

1) Regard dump_only fields as _unknown_ (INCLUDE would include them, RAISE would raise them). This is the current behavior.
2) Regard dump_only fields as _known and acknowledged_ (INCLUDE would ignore them, RAISE would ignore them). This is what this ticket suggests.

In that sense the current behavior (1) is _consistent_ between INCLUDE/RAISE. New behavior (2) could be enabled by different unknown choice (e.g. INCLUDE_XXX, RAISE_XXX). But maybe it makes sense to consider the use cases first. Aligned with the above numbering:

1) As already mentioned, "read-only" fields, e.g. "created_at". But when would you want to use INCLUDE rather than RAISE here? Why allow client send fields that are not specified?
2) I don't actually know - in which cases would we want to ignore these fields during load-time?

So is there actual need for (2)? I'm actually starting to think current behavior is correct. Maybe @lafrech and @deckar01 could clarify their use cases?

we would either need to soften the "unknown" validation -- harden the definition of dump_only so that it doesn't violate the expectations set by the unknown option.

Care to elaborate what you're suggesting in practice here?

My interpretation of "include unknown fields, but only dump foo" when "foo" is loaded from serialized data is that it would pass the schema validation, then fail the field validation.

| | Result | Reason |
|---|---|---|
| RAISE | Schema error | The field key is not declared on the schema. |
| INCLUDE | Field error | The field does not allow a value during load. |
| EXCLUDE | No error, No field value | The value was discarded during the schema validation. |

Speaking to the use case of ignoring dump_only fields even when unknown=INCLUDE (@tuukkamustonen's second way of looking at things):

We are starting to use marshmallow for our data marshaling and validation for our web API. We have some schemas with fields that we do not want the user setting (via PUT and POST calls), and thus would like them to be dump_only. But we would still like the ability to accept arbitrary fields, such as feature flags, without necessarily having to add those fields to the schema.

Being able to ignore dump_only fields would add a layer of protection (albeit not the only layer) that these fields don't leak into the model and possibly get persisted on the underlying resource.

We have some schemas with fields that we do not want the user setting (via PUT and POST calls), and thus would like them to be dump_only. But we would still like the ability to accept arbitrary fields, such as feature flags, without necessarily having to add those fields to the schema.

True, in that context it would make sense to exclude dump_only fields from unmarshalling result with INCLUDE, but it would make even more sense to raise as @deckar01 suggests.

So, should we consider dump_only fields as _known_ and disallowed? That would mean that unknown option would be irrelevant (be it INCLUDE/RAISE/EXCLUDE) because we are not handling an unknown field, but rather known with contract (being read-only). So with any unknown value, marshmallow should raise error (_"Field is read-only."_)?

Or did @deckar01 mean that EXCLUDE would _not_ raise error if dump_only field is given? I think it would be better and more consistent to raise also then (never allow assinging value to a read-only field)?

Being able to ignore dump_only fields

If dump_only is intended for _read-only_ fields, you should never want to ignore them? Just fail fast - if client provides those fields, give error (or client will _think_ that the field it sent actually had impact).

What if dump_only accepted RAISE/EXCLUDE values and could be set to EXCLUDE to explicitly opt in to the silently ignore behavior?

Schema level dump_only takes a list of field names - how would it work with that?

And is there real need to make it configurable? When would you want to silence _read-only_ fields?

Regarding https://github.com/marshmallow-code/marshmallow/issues/875#issuecomment-412692506 and the INCLUDE case

The field does not allow a value during load

INCLUDE means accept all keys and if there is no corresponding field, copy the value verbatim (no validation). It applies if the key is unknown, in which case "the field" is undefined.

If we respect the dump_only property of the corresponding field here to reject the value at field level, then why not also apply its validators? I know the dump_only parameter is not really a validator, but still, it seems inconsistent.

dump_only should have been named something like do_not_load as it makes more sense when using both dump_only and load_only, and I interpret it as _let's pretend this field does not exist when loading_.

Maybe current behaviour is the right one after all.

It does not fit the use case described in https://github.com/marshmallow-code/marshmallow/issues/875#issuecomment-412723132 but neither does marshmallow 2.x. The way I make APIs (just my use case, I don't claim any expertise), I wouldn't use INCLUDE as I wouldn't want garbage to leak in my model and database and I rely on marshmallow to do all the cleanup. @rileyjohngibbs, isn't there a way you could declare all the feature flags as fields? It could be in a mixin Schema all Schemas would inherit from, and it could be created dynamically if there are too many flags.

Overall, I'm in favour of keeping current behaviour. Maybe add a red flag in the docs. And a test.

I'd rather not add another layer of bug-prone complexity to try to provide too much flexibility (YAGNI).

I sent a PR to document and test current behaviour: https://github.com/marshmallow-code/marshmallow/pull/981.

Any objection, anyone?

I also find the current behavior to be the most intuitive. Unless there are any strong objections, I'll plan on merging #981 within the next few days.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ambye85 picture ambye85  ·  4Comments

imhoffd picture imhoffd  ·  3Comments

DenisKuplyakov picture DenisKuplyakov  ·  4Comments

Ovyerus picture Ovyerus  ·  3Comments

nickretallack picture nickretallack  ·  4Comments