Pydantic: Validate on export?

Created on 4 Dec 2019 · 9Comments · Source: samuelcolvin/pydantic

Question

First, thanks for this awesome library. It's great!

TL;DR - Would you consider accepting a PR to optionally validate data when calling dict() or json()?

More context:

One of my main concerns about pydantic is the fact that we might be exporting data invalid data if you are not careful when creating the instances.

Our workflow often follows this pattern:

Parse / create model instances
Modify those instances
Export data

Ideally we would love to validate before exporting to make sure it exports valid data. Let's say out code contains a bug that introduces an invalid element, I would be like to detect it at some point, at least before we export it with dict().

I have been able to implement this in several ways, and I have a solution that is quite good in terms of performance. I would be willing to open a PR if you think this is an interesting feature.

BTW I'm not sure whether this is already implemented and I just missed the feature - but I doubt it because I don't see any logic for this within dict() / _iter() / _get_value methods.

I found that we could use validate_assignment, however this is not "perfect" because if you change a mutable element it's not going to be validated (e.g. to an existing list of elements you could add a new element that is not compliant with the schema). Also, the validate_assignment performance is far from good for our use cases.

I realize that having mypy on a project might make this somewhat unnecessary (regarding type safety) but we would miss full model validations anyway.

Pydantic version import pydantic; print(pydantic.VERSION): 1.2

Feedback Wanted question serialization

Source

javiertejero

All 9 comments

I’m amenable to this idea as a concept, but I would much prefer to support it through new functions if possible, rather than additional arguments to dict/json. Those functions already have a large number of logic forks, and each additional check will only slow them down further. And they are potentially hot-loop functions if dealing with large lists or dicts as fields, for example.

One thing to consider: validators can perform modifications to input data that are not idempotent. What would you want to happen if the results were re-validated with a validator that modifies the value?

For what it’s worth, I think if you can address this issue using mypy, I would strongly recommend that instead. Runtime checks cost a lot (as you’ve noticed with validate_assignment), and won’t help address the underlying issue.

dmontagu on 4 Dec 2019

Also, as @samuelcolvin has stated on many occasions, pydantic was designed around the idea of eager validation, rather than lazy validation. Given performance is such a priority for pydantic I think it makes sense to prioritize the eager validation and have the user take responsibility for ensuring models remain valid. It seems to me that for most use cases, this could be checked by mypy, or at least the model could be refactored so it could be checked by mypy.

If you have a case you think differs on this point, it would help the discussion if you could share some code / detail.

dmontagu on 4 Dec 2019

👍1

Hi @javiertejero, interesting idea.

I'm inclined to say "no", I think best that you create your own CustomBaseModel, with something like:

class CustomBaseModel(BaseModel):
    def dict_valdiate(self, *args, **kwargs):
        m2 = self.__class__(**dict(self))
        return m2.dict(*args, **kwargs)

(or something, there are a number of ways this could be altered depending on your exact context).

Reasons:

I suspect this requirement is fairly rare.
It's likely to cause extra confusion.
It could add significant extra complexity.

However I am interested in how pydantic might be able to help with serialisation.

I was looking and django-restful-framework serializers this morning and wondering what the equivalent would be in pydantic land.

See #935 #951 and #317

samuelcolvin on 4 Dec 2019

Thanks @samuelcolvin and @dmontagu

First, thanks for your answers. I wasn't looking for a proposal, I already implemented it, I just wanted to know if you were interested. Well, to avoid more misunderstandings, this is the solution I have in place: https://github.com/travelperk/pydantic/commit/e10f7bef7433120c461cf938cc386da06462cfce

It's the bare basics of the feature that I need, the Config thing can be changed by a parameter (but I guess for now it's enough).

I think the best benefit is for projects where we generate a schema that is consumed by an external service. Because pydantic is great for parsing and consuming data, but without this functionality there are some risks to generate that that does not comply with the schema.

As you can see the solution is literally +4 lines of code and if the validation is disabled there is no performance penalty (it's almost negligible for relatively big schemas, and it's definitely not a bootleneck).

@samuelcolvin The problem with the solution that you propose above is that it does not validate nested models due to the recursive nature of dict(). That why the proper solution requires to override (.dict()).

What I did would be semantically equivalent to:

    def validated_dict(self):
        return self.__class__(**self.dict()).dict()

However the performance is much better if I keep it in the dict because this validated_dict is validating once but running .dict() twice while my other solution overriding dict() is just validating.

Although I forked the project, this is what I actually end up doing for now to avoid compiling to cython myself (this method does not benefit that much of cython compilation apparently, so it is fine for me to keep it in the custom base model):

class MyCustomBaseModel(BaseModel):
    def dict(
        self,
        *,
        include: Union['AbstractSetIntStr', 'DictIntStrAny'] = None,
        exclude: Union['AbstractSetIntStr', 'DictIntStrAny'] = None,
        by_alias: bool = False,
        skip_defaults: bool = None,
        exclude_unset: bool = False,
        exclude_defaults: bool = False,
        exclude_none: bool = False,
    ) -> 'DictStrAny':
        """
        Generate a dictionary representation of the model, optionally specifying which fields to include or exclude.
        """
        if skip_defaults is not None:
            warnings.warn(
                f'{self.__class__.__name__}.dict(): "skip_defaults" is deprecated and replaced by "exclude_unset"',
                DeprecationWarning,
            )
            exclude_unset = skip_defaults
        get_key = self._get_key_factory(by_alias)
        get_key = partial(get_key, self.__fields__)

        if getattr(self.Config, 'validate_on_export', False):
            target = self.__class__(**self.__dict__)  # creates a copy, validates it and uses it
        else:
            target = self

        allowed_keys = target._calculate_keys(include=include, exclude=exclude, exclude_unset=exclude_unset)
        return {
            get_key(k): v
            for k, v in target._iter(
                to_dict=True,
                by_alias=by_alias,
                allowed_keys=allowed_keys,
                include=include,
                exclude=exclude,
                exclude_unset=exclude_unset,
                exclude_defaults=exclude_defaults,
                exclude_none=exclude_none,
            )
        }

If you happen to change your minds @samuelcolvin I would be happy to contribute and try to port the tests I have prepared for my own codebase to cover this functionality. Otherwise people can just use the snippet above, so I feel I already contributed a bit to this great project :)

@dmontagu: Regarding mypy: It's not enough to validate the property values (the type might be valid but the value itself might be invalid, e.g. exceeds a max or the password does not match).

Also the side benefit of this validate on export is that you get serialized with the proper types, and that can be also very useful to ensure serialization to json is done in the right format (e.g. we have some custom types to serialize Decimals with a given number of decimal places).

Cheers.

javiertejero on 5 Dec 2019

@javiertejero Your point about mypy makes sense.

With regard to performance, I'm not sure the line indicated below is as cheap as you think it is:

        if getattr(self.Config, 'validate_on_export', False):  # <--- this line
            target = self.__class__(**self.__dict__)  # creates a copy, validates it and uses it
        else:
            target = self

While this won't matter in many cases, when serializing a long list of BaseModel instances, I think the effect would be macroscopic (I would guess to the tune of ~1%, though if an appropriate benchmark showed it was much smaller I would change my mind). We've been consistently aggressive about minor optimizations like this everywhere possible (especially in "hot" functions like this one) and the net effect is substantial.

I think I have three main concerns with regard to this feature:

Everywhere else in the library, Pydantic does validation eagerly, so this feels a little inconsistent
Config settings are especially painful to maintain
It may cause confusion in the presence of non-idempotent types (or validators, though I guess that would be more of the user's responsibility)

Moreover, I personally think it is outside the scope of pydantic to ensure your model's invariants are maintained in your own code.

That said, I think your implementation is good, and I understand the value this feature provides, so I think I could be convinced to accept it. But I would prefer to see broader interest in the feature first, as I think it adds a non-trivial maintenance burden (an additional fork to test, dependence on re-parsing-behaviors, dependence on the contents of __dict__, etc., making it that much harder to refactor things in a major version update).

dmontagu on 6 Dec 2019

re-opened to discuss, but I too am inclined to refuse it.

Python is not a strictly typed language, static analysis mitigates some of the risk associated with that, pydantic helps with other things but if you really need strict typing throughout your application python probably isn't the right language for you.

Also, explicit is better than implicit, given how little work it is to type out another step of model validation, I would be inclined to recommend you call MyModel(**dict(old_model)) manually.

samuelcolvin on 6 Dec 2019

Python is not a strictly typed language, static analysis mitigates some of the risk associated with that, pydantic helps with other things but if you really need strict typing throughout your application python probably isn't the right language for you.

Well, in this case, it sounds like the concern goes beyond types to invariants that most/all languages wouldn't provide at the type-system level. Not to disagree with anything though.

Also, explicit is better than implicit, given how little work it is to type out another step of model validation, I would be inclined to recommend you call MyModel(**dict(old_model)) manually.

@javiertejero did point out (correctly, as far as I can tell) that this approach has a substantial performance penalty relative to the proposed approach (when making use of the feature).

dmontagu on 6 Dec 2019

@dmontagu regarding your comment:

One thing to consider: validators can perform modifications to input data that are not idempotent. What would you want to happen if the results were re-validated with a validator that modifies the value?

I didn't consider this initially. I did a test and it's really bad for the data if the modifications are not idempotent, e.g.:

def test_validator_changing_immutable_values():
    class ValidatorModifyValues(MyCustomBaseModel):
        foo: str
        bar: list

        @validator('*')
        def should_be_shorter_than_3(cls, v):
            return v[:3]

        @validator('bar')
        def pop_a_value(cls, v):
            if len(v):
                v.pop()  # modifies the mutable input list
            return v

    data = {'foo': 'abcdedf', 'bar': [1, 2, 3, 4, 5]}
    vmv = ValidatorModifyValues(**data)
    assert data == {'foo': 'abcdedf', 'bar': [1, 2, 3, 4]}
    assert vmv.dict() == {'foo': 'abc', 'bar': [1, 2]}
    assert vmv.dict() == {'foo': 'abc', 'bar': [1]}
    assert vmv.dict() == {'foo': 'abc', 'bar': []}
    assert vmv.dict() == {'foo': 'abc', 'bar': []}

Having said that I guess it is a bad design/practice for a validator to modify the original value instead of doing a copy. Or at least instead of doing idempotent transformations. So for me this wouldn't be a stopper for the feature.

Regarding the fact that Config options are a burden to maintain I agree and that's why I used the getattr, to reduce that burden.

I also agree the performance penalty is something that might be of concern if your priority is top performance. I benchmarked with this code 1M objects:

def test_benchmark_base_model():
    BaseModel.Config.validate_on_export = False

    class Foo(BaseModel):
        i: int

    class Bar(BaseModel):
        foos: List[Foo]

    size = 1000000
    bar = Bar(foos=[Foo(i=x) for x in range(size)])

    start = time.perf_counter()
    bar.dict()
    dict_time = int((time.perf_counter() - start) * 1000)
    print(f"dict() time: {dict_time} ms")

The results:

2691 ms with pydantic v1.2 and
2938 ms with my first proposal (https://github.com/travelperk/pydantic/commit/e10f7bef7433120c461cf938cc386da06462cfce and
2721 ms with a new proposal using an extra parameter instead of Config (see https://github.com/travelperk/pydantic/commit/9183fe3eea6a184e65148d69789f4efa5dfd5477).

For my current use case I don't worry that much about the performance of dict() because our bottleneck is in a different place, and we are in a lower order of magnitude of objects (so ~30 ms is negligible in our case). However I understand it's not a good thing for pydantic at the moment, unless the community shows an interest on this feature more broadly.

javiertejero on 6 Dec 2019

👀1 👍1

Well, in this case, it sounds like the concern goes beyond types to invariants that most/all languages wouldn't provide at the type-system level.

Right, it's not only about types. Also, Python is our language of choice (for many reasons) and my requirement is to ensure that an http response honors the agreed schema. So it's not about the language at all, but more about API contracts.

I also think this feature is specially interesting during tests and might be disabled in production in some services if we are sure the consumers are capable to verify the responses later. In our case I would enable it in production for services that send responses directly to the browser consumer (where we use a different technology that at the moment assumes the schema is always honored).

javiertejero on 6 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings