Marshmallow: Feature request: Validate should accept deserialized values as valid for all fields

Created on 19 Sep 2019  Â·  22Comments  Â·  Source: marshmallow-code/marshmallow

One of our big motivations for using Marshmallow is to validate that the data created by our programs actually matches the schema of the data we want to transmit on the wire. There appears to be no way to do this with Marshmallow 3, since validation was removed from 'dump'. (The the other motivation is to validate received data, of course, but that is handled just fine.)

This has been discussed in three closed issues that I found, but no solution has been suggested using existing Marshmallow 3 features.

In #656 the case I'm immediately facing is discussed: if I pass an actual datetime object to validate, I get an error that it is not a valid datetime. This is because validate is only handling string input. A poster in that issue suggested that it would be good if validate accepted the deserialized values for all fields as valid. lafrech responded that if someone was interested in this as a feature to open a new issue with a feature request. This is that issue.

As additional context, in issue #1372 we are told that "Validation is only meant to be applied to serialized data, during a load", and that "Marshmallow does not validate object data before dump". The porting to Marshmallow 3 docs repeat the second part of that, but don't mention the first part. Further, the porting docs say "You must validate your data before serializing it". However, you can't validate the data before dumping it if it isn't strings. So that's a bit of a conundrum. Further, you can't dump it first, because of the reason the porting docs say you must validate first: if you try to dump invalid data you will get various exceptions (for example, if you mistakenly set a fields.DateTime field to the string version of a datetime, on dump you will get an attribute error on 'isoformat').

The third closed issue on this topic that I found is #1393, which was closed with "Validating deserialized objects isn't supported out-of-box. Serializing your object before passing it to schema.validate or schema.load is your best option." But, as discussed above, this is not a viable solution since you will get exceptions on certain kinds of invalid data when serializing, which makes it difficult to figure out where the actual problem is (which field's value is causing the problem).

It seems to me that it is obviously desirable to validate data before putting it on the wire. The programmer is going to make mistakes, some of which won't be found until runtime.

So, in summary, I think the ability to validate deserialized data is a major missing feature of Marshmallow 3. It doesn't look like we'll be able to port our project to Marshmallow 3 until this feature is available.

(Aside: I don't think the validation we are using with Marshmallow 2 is actually doing all of the job we thought it was doing, but that is a different problem.)

enhancement feedback welcome

All 22 comments

Thanks for the detailed report. Seems like there's two main asks here:

  1. Fields should accept data that is in the deserialized type, e.g. datetime -load-> datetime
  2. load and validate should be able to accept objects (not just dicts)

1 is reasonable and is relatively straightforward to implement.

2 is trickier due to load semantics. In particular, what would unknown validation look like? I'm not opposed to supporting this, but TBH I haven't had the time to give it proper thought.

Thoughts and opinions here are welcome.

(1) is all we need. We build dicts of data (json, essentially) and want to pass them to Schema.validate. I don't actually know what it would mean for load and validate to accept objects?

(1) is all we need. We build dicts of data (json, essentially) and want to pass them to Schema.validate.

Oh ok. Seems reasonable.

For now, can you workaround this with custom fields?

h/t to @atmo for this in marshmallow-jam

from datetime import datetime, time, date, timedelta

from marshmallow.fields import DateTime, Time, Date, TimeDelta


class WithIdentityMixin:
    identity_type: type

    def _deserialize(self, value, attr, data, **kwargs):
        if isinstance(value, self.identity_type):
            return value
        return super()._deserialize(value, attr, data, **kwargs)


class DateTimeIdentity(WithIdentityMixin, DateTime):
    identity_type = datetime


class TimeIdentity(WithIdentityMixin, Time):
    identity_type = time


class DateIdentity(WithIdentityMixin, Date):
    identity_type = date


class TimeDeltaIdentity(WithIdentityMixin, TimeDelta):
    identity_type = timedelta

I don't actually know what it would mean for load and validate to accept objects?

I mean that load and validate would be able to take complex objects, e.g. ORM object instances, as input., rather than only accepting dictionaries.

umongo does that for some fields when needed (typically datetime) but not in this nice generic fashion.

https://github.com/Scille/umongo/blob/master/umongo/fields.py#L180

I guess it has to be different with Nested since the type depends on the nested object. But it should be achievable since the object type is known.

@lafrech Wdyt about changing this in marshmallow?

I haven't been giving it much thought but I don't see any objection right away. Would that be optional ? Opt-in ? Opt-out ?

I don't see any downside. I mean from a json API perspective, it wouldn't make the API more permissive, as inputs won't come as objects anyway.

(umongo hasen't been ported to MA3, yet...)

I don't think we should add a flag for it--unnecessary API surface.

I also don't see much downside.

load and validate should be able to accept objects (not just dicts)

Validate currently occurs before any @post_load hooks, which is why it only has to deal with dicts. There are lots of ways you could transform the keys/values in @post_load that would violate the expectations of validate(), but I think since this use case doesn't need to load, their schema probably doesn't have any load hooks either.

What about exposing a Schema._validate(obj) method and using it inside Schema.load(data) after Schema._do_load(data)? Allowing validation to recurse over the data separately should also fix #1190.

@deckar01 I'm good with that proposal.

What do you think of the proposal above to modify fields to accept data in their deserialized type?

Loading rich python objects seems like a significant departure from marshmallow's existing functionality. I think requiring custom fields to pass through rich values is fine.

I'm not convinced our current workaround of "dump so you can load to get validation" won't work for this use case though. @bitdancer What do you mean by "you can't dump it first"? You should always be able to dump first, but it might error. If this issue is about "programmer is going to make mistakes", letting an unexpected error bubble to the top seems like reasonable behavior.

That said, it would be nice to avoid the extra load call and knock out a bug in the process.

Loading rich python objects seems like a significant departure from marshmallow's existing functionality.

Perhaps this is only relevant for the time-related fields like DateTime. I imagine are cases where e.g. ISO8601-formatted strings get decoded to datetimes before reaching marshmallow, like in @lafrech's umongo case.

In umongo, marshmallow is used in the model to

  • serialize/deserialized to/from MongoDB
  • validate data passed to the object (overloading setattr and all)

The second case is the one where the feature is needed and all fields are concerned, not specifically DateTime. It is not needed for float, int, etc, because loading is trivial, but although this is true for most of the fields, I'd consider that the exception, and fields like DateTime the rule _a priori_.

So, umongo is doing a second deserialization pass after mongo finishes parsing it from BSON. It probably makes sense for that use case to override the field types that come in rich and treat the data as if it has not been loaded yet. That seems quite a bit different than the use case in the OP where the rich object is being instantiated for the first time and wants to use validation to check for implementation errors in the instantiation logic.

I may be misunderstanding the OP. I thought they were only dealing with dicts, not instantiated objects.

(1) is all we need. We build dicts of data (json, essentially) and want to pass them to Schema.validate.

@bitdancer Can you clarify your use case?

So, umongo is doing a second deserialization pass after mongo finishes parsing it from BSON.

Hmm, not exactly.

A umongo Schema can either load from BSON, or from "object world".

Loading from BSON takes place when reading in DB. Loading from object world takes place at object instantiation (schema load) or on setattr (field load).

Like marshmallow-mongoengine, umongo can export plain marshmallow schemas to use as API interface. When processing a request, the data is first validated in the API layer by the generated marshmallow schema, then validated once again when creating the object. This might not be optimal, but oh well.

Ok, I think I was missing the point of that umongo tangent. umongo uses the custom fields pattern and it works for them even for use cases very similar to the OP, right?

The only down side I see to baking this into the core fields is that it would add type checking overhead that rarely is needed.

The only down side I see to baking this into the core fields is that it would add type checking overhead that rarely is needed.

Yeah, but this seems a common enough use case to justify it. I'd be more worried if we were adding type checks to the serialization path. In typical deserialization use cases (i.e. validating JSON HTTP payloads), this level of overhead is less critical. wdyt?

@deckar01: the problem isn't that the programmer might have made an error in constructing an object. We're building an event driven system and using Marhsmallow for event construction and validation. Essentially, the Marshmallow/json are our objects at the event layer. We're not doing any sort of json-to-object conversion in many cases, just using the json directly. So consider the case where we receive an event, manipulate the (deserialized) data, then construct a new event to send out...during that construction there may be bugs, and those bugs may be data driven (not seen until some unusual piece of input data is received). So, we want to validate that the data we've constructed matches the output event schema before we send it on the wire. We could add validation code to our application to make sure the data was valid before dumping it with Marshmallow...but validating json objects is why we started to use Marshmallow in the first place :) Using it for event serialization actually came second in our development process, IIRC.

As for letting the errors from dump bubble up...here is an example traceback from one of our unit tests if we dump first before validating:

Traceback (most recent call last):
  File "/home/rdmurray/con/dev/conlib/events/test/test_events.py", line 202, in test_invalid_metadata
    event.validate()
  File "/home/rdmurray/con/dev/conlib/events/conlib/events.py", line 143, in validate
    data = schema.dump(self.metadata)
  File "/home/rdmurray/con/dev/conlib/events/venv/lib/python3.6/site-packages/marshmallow/schema.py", line 553, in dump
    result = self._serialize(processed_obj, many=many)
  File "/home/rdmurray/con/dev/conlib/events/venv/lib/python3.6/site-packages/marshmallow/schema.py", line 517, in _serialize
    value = field_obj.serialize(attr_name, obj, accessor=self.get_attribute)
  File "/home/rdmurray/con/dev/conlib/events/venv/lib/python3.6/site-packages/marshmallow/fields.py", line 325, in serialize
    return self._serialize(value, attr, obj, **kwargs)
  File "/home/rdmurray/con/dev/conlib/events/venv/lib/python3.6/site-packages/marshmallow/fields.py", line 1189, in _serialize
    return format_func(value)
  File "/home/rdmurray/con/dev/conlib/events/venv/lib/python3.6/site-packages/marshmallow/utils.py", line 186, in isoformat
    return datetime.isoformat()
AttributeError: 'str' object has no attribute 'isoformat'

Now, from the fact that it is looking for isoformat I can tell it is a datetime field that is the problem, but I have no clue which field. So, yes, we are trading some (small, I think, given that we really do need to validate before putting anything on the wire) runtime validation overhead for clarity of error messages here, but we judge that to be a good tradeoff given that when diagnosing a data driven runtime validation problem all we have is logs that we can look at after the fact.

We could, I suppose, have our validation function log the full json data if dump raises an error. That still doesn't identify which field had the problem, but would in most cases probably allow us to reproduce the problem after a bit of cutting and pasting and building a test. And yes, errors that trigger dump to raise are considerably less likely to be data-driven than logic driven. Frankly, though, accepting the above traceback as "good enough" feels sloppy to me, like not-quite-fully-baked code. And I say that even though as a CPython developer I've been known to reject adding data type checking to standard library functions in favor of just letting the errors bubble up. I think that the difference here for me is that in this case the purpose of the library from my POV is type checking :).

If Marshmallow weren't inclined to provide this feature I suppose that is a way we could go. WithIdentityMixin does work though (thank you for that!) so at least for now that's the way we are going.

I would lean towards providing the mixin rather than modifying core fields. This is a functionality that I would not expect to be enabled by default and might surprise me if I accidentally passed in an already deserialized object.

The problem with a mixin is that DateTime, etc, are core types, meaning uses of the library would need to subclass these into custom types and then make sure they used those and not the core types. Possible, yes, but prone to confusion, especially when working on a complex system involving many programmers. It also becomes an issue if you start out with multiple pieces of the system using the standard types, and then at some point in the project's development decide you need deserialization to pass through already deserialized objects. Now you either have to monkey patch Marshmallow or fix the imports in all those existing parts of the system. Again, that is doable. Just inconvenient :) Which is not necessarily enough of an argument.

It is clearly necessary to somehow indicate which objects are to be passed through as opposed to deserialized. It would be nice if this were part of the standard API. Would the overhead of a default-False Meta flag to determine if deserialized objects should be allowed through deserialization at all and a default-False hook that deserialization would call that reports whether the object is already deserialized be too much overhead?

For future people who stumble upon this issue, a workaround for the time being is to use marshmallow.fields.Raw():

An example (using the desert add-on for marshmallow-dataclass):

import desert
import marshmallow

from dataclasses import dataclass
from datetime import datetime
from bson import ObjectId

@dataclass
class A:
  d: datetime = desert.field(marshmallow.fields.Raw())
  b: ObjectId = desert.field(marshmallow.fields.Raw())
>>> s = desert.schema(A)

>>> a = s.load({"d": datetime.now(), "b": ObjectId.from_datetime(datetime.now())});

>>> a
A(d=datetime.datetime(2020, 8, 10, 14, 6, 30, 211926), b=ObjectId('5f3154660000000000000000'))

Hope this helps.

Related.

This is achieved in umongo by accepting deserialized value explicitly in each field's _deserialize method. Example for DateField:

 def _deserialize(self, value, attr, data): 
     if isinstance(value, dt.date): 
         return value 
     return super()._deserialize(value, attr, data) 

There is an issue here if we rely on this to validate the input type because we let subclasses pass in. For instance, in this example, we may expect the value to be a date because we think marshmallow validates that, but we also accept datetimes. A datetime is a date subclass so it may make sense, but the serialization will differ so it can lead to surprises.

https://github.com/Scille/umongo/issues/227

We could cast explicitly to the expected class but that sucks, if only for the perf impact. Or we assure the user knows what he's doing and we let it go but then we don't really validate the data since we don't ensure it is of the right type.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DenisKuplyakov picture DenisKuplyakov  Â·  4Comments

lassandroan picture lassandroan  Â·  3Comments

zohuchneg picture zohuchneg  Â·  3Comments

agatheblues picture agatheblues  Â·  3Comments

ambye85 picture ambye85  Â·  4Comments