Marshmallow: Remove validation on serialization

Created on 12 Feb 2019 · 17Comments · Source: marshmallow-code/marshmallow

When using Dict(values=Nested()) I expect errors in dictionary values to be nested in such a way that I can locate exactly what element is causing the problem. This seems to work as expected for load(), but not for dump().

Minimal example:

Marshmallow version: 3.0.0rc4
Python version: 3.7.2

from marshmallow import Schema, ValidationError, fields


class InnerSchema(Schema):
    x = fields.Int()


class OuterSchema(Schema):
    inner = fields.Dict(keys=fields.Str(), values=fields.Nested(InnerSchema))


foo = {"inner": {"some-key": {"x": "abc"}}}

try:
    OuterSchema().load(foo)
except ValidationError as e:
    print(e)
    # >>> {'inner': defaultdict(<class 'dict'>, {'some-key': {'value': {'x': ['Not a valid integer.']}}})}
    # Very helpful - and what I expect

class Inner:
    def __init__(self, x):
        self.x = x


class Outer:
    def __init__(self, inner):
        self.inner = inner


bar = Outer(inner={"some-key": Inner(x="abc")})

try:
    OuterSchema().dump(bar)
except ValidationError as e:
    print(e)
    # >>> {'inner': {'x': ['Not a valid integer.']}}
    # Surprisingly - 'some-key' and 'value' are not present here,
    # which makes it harder to pinpoint the problem in large objects

backwards incompat bug

Source

NilsBarlaug-cognite

Most helpful comment

Probably not? handle_error is meant for handling ValidationError, which we're proposing should only get raised on deserialization.

sloria on 17 Jul 2019

👍2

All 17 comments

This seems to be reproducible irrespective of use of the Nested field type:

from marshmallow import Schema, ValidationError, fields


class SomeDictSchema(Schema):
    inner = fields.Dict(keys=fields.Str(), values=fields.Int())

baz = {'inner': {'some-key': 'five'}}

try:
    SomeDictSchema().load(baz)
except ValidationError as e:
    print(e)
    # {'inner': defaultdict(<class 'dict'>, {'some-key': {'value': ['Not a valid integer.']}})}

try:
    SomeDictSchema().dump(baz)
except ValidationError as e:
    print(e)
    # {'inner': ['Not a valid integer.']}

The second error message suggests that inner should be an integer, when the fault is in the dictionary at inner.

rileyjohngibbs on 12 Feb 2019

👍1

For what it's worth, the defaultdict(...) in the repr is also not very helpful, and undesirable in an REST API context when it would be nice to just return the whole error message to the client.

rileyjohngibbs on 12 Feb 2019

👍1

List has the same issue. _serialize doesn't expect any errors when calling _serialize, but _deserialize does.

from marshmallow import Schema, ValidationError, fields


class Test(Schema):
    foo = fields.List(fields.Int)

baz = {'foo': ['five']}

try:
    Test().load(baz)
except ValidationError as e:
    print(e)
    # {'foo': {0: ['Not a valid integer.']}}

try:
    Test().dump(baz)
except ValidationError as e:
    print(e)
    # {'foo': ['Not a valid integer.']}

deckar01 on 12 Feb 2019

Tuple is also affected.

deckar01 on 15 Feb 2019

Is this only happening on dev/3.x? Or is it also happening in 2.x? @deckar01

sloria on 15 Feb 2019

I was able to reproduce the issue for List in 2.x-line. Dict doesn't have type constraints in 2.x-line and Tuple does not exist.

Do you think the fix be considered a breaking change if people are expecting list errors to be a string rather than a dictionary?

deckar01 on 15 Feb 2019

@deckar01 While I do think that this is a bug because it breaks users' expectations, it's possible that users may be relying on the current behavior. Let's be conservative and consider it a breaking change.

sloria on 15 Feb 2019

👍2

I'm curious about existing use cases for validating on serialization. We've since decided that validation should only happen on deserialization and that the input to dump is assumed valid. For this reason, we removed validation-on-dump from Email And URL fields. Given that the fix for this issue incurs a non-zero performance regression, I wondering if we should continue supporting dump validation at all.

@NilsBarlaug-cognite @rileyjohngibbs Can you speak about your use cases here?

sloria on 16 Jul 2019

I'm curious about existing use cases for validating on serialization. We've since decided that validation should only happen on deserialization and that the input to dump is assumed valid. For this reason, we removed validation-on-dump from Email And URL fields. Given that the fix for this issue incurs a non-zero performance regression, I wondering if we should continue supporting dump validation at all.

@NilsBarlaug-cognite @rileyjohngibbs Can you speak about your use cases here?

I think if the (clear) expectation is that dumped data is valid, then I support removing dump validation. I don't think we currently (explicitly) rely on dump to validate anything.

rileyjohngibbs on 16 Jul 2019

Currently almost all of the fields try to handle exceptions with validation failure on dump. Since the dump() errors aren't meant to be caught and handled, there probably is no need to normalize them into ValidationError instances. Should we get rid of the code that rewrites the errors in _serialize()? If we aren't going to pack the errors into the same data structure as load(), should we let the first error bubble to the top uninterrupted?

deckar01 on 16 Jul 2019

Should we get rid of the code that rewrites the errors in _serialize()? If we aren't going to pack the errors into the same data structure as load(), should we let the first error bubble to the top uninterrupted?

Yes, that's what I'm proposing.

sloria on 16 Jul 2019

👍2

Should we still try to call handle_error in _serialize so that the error can be customized per schema on dump?

deckar01 on 17 Jul 2019

Probably not? handle_error is meant for handling ValidationError, which we're proposing should only get raised on deserialization.

sloria on 17 Jul 2019

👍2

@NilsBarlaug-cognite @rileyjohngibbs Can you speak about your use cases here?

We have a library which helps users work with some specific JSON documents (which they might use locally or against our HTTP API). Basically letting them easily load, dump and modify these JSON documents in a Python native way. For improved user experience we validate the JSON documents both on load and dump. The user can create/modify a document in Python - so we have no guarantees that the data is valid when we dump it.

This is probably not the intended use of Marshmallow since you remove dump validation. Do you have any suggestion on how to achieve the same when you remove dump validation?

NilsBarlaug-cognite on 5 Aug 2019

Under the hood, validate calls the same logic as load, it just doesn't return the deserialized data.

deckar01 on 5 Aug 2019

👍1

@deckar01 As far as I understand, load accepts dicts of primitive types whereas dump would accept native Python objects. Does this mean that the validation feature would be limited to objects composed of primitive types only? How would one go about validating a native object?

ckanesan on 7 Aug 2019

@ckanesan you're right about validate taking serialized (primitive types) data as input.

Maybe you could call it after serialization, but it might fail on edge cases (e.g. asymmetrical dump/load keys, pre-post processings,...).

We can't really extract validation from the loading process because it is done in each field's deserialization method. It is not a two-step process (deserialization, then validation), except for schema validators, that occur at the end of the process.