Pydantic: Serialize as a specified model / field type

Created on 15 Sep 2019 · 11Comments · Source: samuelcolvin/pydantic

Feature Request

I would like to be able to serialize/dictify models in a way that ensures the result complies with a specified model schema, even if some of the attributes, or even the model itself, are subclasses of the desired result.

I wouldn't necessarily want to get rid of the current behavior of including all fields, but it would be nice if there was either a boolean flag or a separate function that could be called to ensure the serialized/unstructured output matched the field's schema.

More detail:

Right now, both dict- and json-generation happen in a "recursive" way that causes models to potentially include extra fields. Consider the following example:

from pydantic import BaseModel

class ModelA(BaseModel):
    integer: int

class ModelB(ModelA):
    string: str

class ModelC(BaseModel):
    a: ModelA

model_b = ModelB(integer=1, string="b")
model_c = ModelC(a=model_b)
print(model_c.dict())  # desired output: {'a': {'integer': 1}}
# {'a': {'integer': 1, 'string': 'b'}}

print(model_c.json())  # desired output: {"a": {"integer": 1}}
# {"a": {"integer": 1, "string": "b"}}

It is definitely useful (for both performance and non-performance related reasons) to allow subclass instances to pass validation, but in a context where the goal is to return json that matches the model schema, this can be problematic.

FastAPI has dealt with this by creating a "clone" of the desired response field for which none of the fields / subfields are part of the same class hierarchy. Then, when validating an endpoint response, it is essentially unstructured and reparsed into the "cloned" field, which ensures it only has the fields as described in the model schema. This works, but:

It adds a great deal of overhead to the model serialization process, as it means known-to-be-valid models/field values go through three unnecessary steps during serialization: conversion to unstructured format, instantiation of the cloned type(s), and re-parsing of the unstructured data into the cloned type(s).
It makes use of a lot of infrastructure, so that one-off serialization/unstructuring-with-a-desired-schema is not really feasible to perform on demand.

Proposed implementation:

I propose the addition of a function dump_as_type with the following signature:

def dump_as_type(obj: T, type_: Type[T]) -> Any: ...

This function will take a model, list of models, etc., and a desired type to use while dumping the model the model. The result will be as though a strict instance of Type was dumped, removing any fields present in subclasses, and ignoring any field-type overrides.

For more concrete examples, see (the tests in) #812, which also includes a related function parse_as_type.

feature request

Source

dmontagu

👍2

Most helpful comment

@samuelcolvin Here's a more tangible example that might better convey the purpose:

from typing import Sequence
from pydantic import BaseModel, dump_as_type

class UserAPI(BaseModel):
    username: str
    email: str

class UserCollection(BaseModel):
    users: Sequence[UserAPI]

class UserInDatabase(UserAPI):
    hashed_password: str

user_in_db = UserInDatabase(username="a", email="b", hashed_password="c")

# Imagine you are returning the following object from an endpoint, and want to serialize it:
collection = UserCollection(users=[user_in_db])

# The idea is that the process producing the collection of users
# generated a `Sequence[UserInDatabase]`,
# but even with strict type hinting and plugins this would pass type hinting
# as a `Sequence[UserAPI]` due to the subclass relationship
# (And it would pass parsing for a `List` at runtime anyway)

# bad: leaks hashed password
print(collection.dict())
# {'users': [{'username': 'a', 'email': 'b', 'hashed_password': 'c'}]}

# good: doesn't include extra fields present only on the subclass
print(dump_as_type(collection, UserCollection))
# {'users': [{'username': 'a', 'email': 'b'}]}

# This only gets harder to accomplish when you want to work with something that isn't a BaseModel
# e.g., if the goal was to just return a List[UserAPI]
# (this is now handled by dump_as_type)

The problem is that right now, if I have a complex process generating a model to return from a web api endpoint, I have to be careful that I'm not returning a subclass with sensitive data anywhere in the process, or else I could be leaking information when I serialize the response.

Also, even if the information is not sensitive, a ClientSDK may raise errors if it receives extra fields (e.g., those generated by openapi-generator based on an OpenAPI spec typically will, in my experience), and as the above example shows this can happen pretty easily when working with nested models.

FastAPI allows you to specify a response_model for an endpoint, and it does currently guarantee that the response that is returned from the endpoint will match the schema for that model.

Unfortunately, due to the way pydantic currently handles model parsing (where subclasses are allowed, as shown in the example above), a rather large amount of infrastructure has been created in fastapi to create a "copy of the to make sure no extra data is leaked fastapi currently takes whatever you return from your endpoint function, dumps it to a dict, and reparses it into a new field copied from the old one in such a way that there is no shared class hierarchy (so that pydantic won't shortcut any of the parsing steps due to a subclass relationship).

In particular, even if you return an instance of exactly the documented return type, fastapi has to do this dump and reparse to ensure no unwanted fields are returned. There are other ways that this could be handled, but all that I've considered either increase the chance that the response (accidentally) violates the documented schema, or come with substantial performance sacrifices. I also think if there was a good way to do this that didn't involve the heavy serialization overhead, @tiangolo would have thought of it 😄. #812 feels to me like the "right" way to accomplish this feature without making sacrifices.

For reference, this is the function used to create the cloned field:
https://github.com/tiangolo/fastapi/blob/580cf8f4e2aac3d4f298fbb3ca1426f9ea6265de/fastapi/utils.py#L54

I think it would be nice if this extra source of complexity could be removed (not to mention the 2x performance improvement on serialization).

dmontagu on 16 Sep 2019

🚀1 😄1 👍1

All 11 comments

PR generally looks good, but I'm still a little confused about the usage (maybe I'm still not awake properly).

Let's say we have (from your above example),

from pydantic import BaseModel

class ModelA(BaseModel):
    integer: int

class ModelB(ModelA):
    string: str

class ModelC(BaseModel):
    a: ModelA

model_b = ModelB(integer=1, string="b")

is the point here to do:

model_c = ModelC(a=model_b.dict(as_type=ModelA))

Or are there other applications/problems to solve?

samuelcolvin on 16 Sep 2019

@samuelcolvin Here's a more tangible example that might better convey the purpose:

from typing import Sequence
from pydantic import BaseModel, dump_as_type

class UserAPI(BaseModel):
    username: str
    email: str

class UserCollection(BaseModel):
    users: Sequence[UserAPI]

class UserInDatabase(UserAPI):
    hashed_password: str

user_in_db = UserInDatabase(username="a", email="b", hashed_password="c")

# Imagine you are returning the following object from an endpoint, and want to serialize it:
collection = UserCollection(users=[user_in_db])

# The idea is that the process producing the collection of users
# generated a `Sequence[UserInDatabase]`,
# but even with strict type hinting and plugins this would pass type hinting
# as a `Sequence[UserAPI]` due to the subclass relationship
# (And it would pass parsing for a `List` at runtime anyway)

# bad: leaks hashed password
print(collection.dict())
# {'users': [{'username': 'a', 'email': 'b', 'hashed_password': 'c'}]}

# good: doesn't include extra fields present only on the subclass
print(dump_as_type(collection, UserCollection))
# {'users': [{'username': 'a', 'email': 'b'}]}

# This only gets harder to accomplish when you want to work with something that isn't a BaseModel
# e.g., if the goal was to just return a List[UserAPI]
# (this is now handled by dump_as_type)

FastAPI allows you to specify a response_model for an endpoint, and it does currently guarantee that the response that is returned from the endpoint will match the schema for that model.

For reference, this is the function used to create the cloned field:
https://github.com/tiangolo/fastapi/blob/580cf8f4e2aac3d4f298fbb3ca1426f9ea6265de/fastapi/utils.py#L54

I think it would be nice if this extra source of complexity could be removed (not to mention the 2x performance improvement on serialization).

dmontagu on 16 Sep 2019

🚀1 😄1 👍1

is the point here to do:
model_c = ModelC(a=model_b.dict(as_type=ModelA))
?

The point here is that (likely somewhere else in your code), a ModelC was generated:

model_c = ModelC(a=model_b)

and then you need to "safely" serialize it:

# return model_c.dict()  # bad: includes hashed_password, but you might not realize it
return dump_as_type(model_c, ModelC)  # safe

It may seem like include or exclude can save the day here, and in a simple example like this it could without too much effort. But I'm not sure how well this generalizes to more complex models, e.g. something like:

class AppData(BaseModel):
    items: List[str]
    users: Dict[UserID, Tuple[UserAPI, RegistrationTime]]

If there is currently a way to serialize an instance of AppData (as above) with UserInDatabase instances in the first element of the tuple that 1) doesn't include hashed_password in the serialized value, and 2) could be extended to work for arbitrary model types without special casing (so it could be done by a framework, *cough fastapi cough*), then I think this PR may be redundant.

I'm just not aware of such a pattern.

dmontagu on 16 Sep 2019

Makes more sense, I'm now getting my head around the problem.

Here's an alternative solution (might or might not use some of the same code):

class UserCollection(BaseModel):
    users: Sequence[UserAPI]

    class Config:
        strict_models = True

using strict_models (or a better name) would change

https://github.com/samuelcolvin/pydantic/blob/ef894d20b3fd82e63ed033c69db6b4735c1ec6a1/pydantic/main.py#L450-L451

so does something like

elif type(value) == cls:
    return value.copy()
elif isinstance(value, cls):
    return cls(**dict(value))

We could also extend validate for the case of a different pydantic model (including subclasses) to iterate over fields and check they look the same, and thereby avoid unnecessary repeat validation; but maybe that's not necessary.

Would that work for you?

Either way I think we can keep the utility functions parse_as_type and dump_as_type.

samuelcolvin on 16 Sep 2019

@samuelcolvin Yes, that's the idea (and I like the strict_models config idea). However, I have two issues:

Wouldn't that approach essentially repeat the parsing? That's what I was trying to avoid with this design; if it's going to repeat the validation then I think we'd lose the performance benefits. Eliminating the complexity around the cloned field would still be a benefit though. (Also, it would render strict-mode serialization impossible for models with non-idempotent types, if I understand correctly.)
I would prefer if it were possible to get the result of dump_as_type even for models without the config setting -- I think it could add some complexity if that needs to be added somehow by the framework.

dmontagu on 16 Sep 2019

Oh, another issue with changing dict:

What if the subclass is the top level item? So, if I wanted to return a UserAPI but received a UserInDatabase? If I call the .dict method, even in strict mode, I won't get the result I want. dump_as_type(user, UserAPI) should still generate the right result though.

That said, I'd definitely be amenable to an alternate approach that changes .dict in a simpler way, or that creates a different method (or set of methods), or that does something else entirely; my goal is just to get a fully-general dump_as_type function, ideally that is as performant as possible. (I'm also open to renaming.)

I'll hold off on writing any docs in #812 until we've hashed out the plan here, but to be clear I will be happy to handle any necessary work to get this through.

dmontagu on 16 Sep 2019

👍1

Wouldn't that approach essentially repeat the parsing?

Well only in the case of a subclass, but not the exact same class.

I'm concerned about this situation:

class UserAPI(BaseModel):
    username: str
    email: str

class UserCollection(BaseModel):
    users: Sequence[UserAPI]

class UserInDatabase(UserAPI):
    username: int
    hashed_password: str

(I know it doesn't make sense for username to be an integer, but you get the idea).

What does dump_as_type(collection, UserCollection) do? I know your main concern is excluding fields that aren't included in UserAPI, but if we've said we'll return something that looks like a UserAPI people would expect username to be a str. Does as_type implement this?

As I said, to fix this we either need to compare __fields__ (perhaps what fast-api is already doing?) or use full on validation. Maybe there's another way I'm missing

I would prefer if it were possible to get the result of dump_as_type even for models without the config setting

What if the subclass is the top level item?

I agree, but same as above, we need a approach that guarantees the type of fields as well as which fields. Does your current implementation achieve that?

samuelcolvin on 16 Sep 2019

@samuelcolvin Yes, the current implementation achieves this:

from pydantic import BaseModel, dump_as_type

class A(BaseModel):
    x: int

class B(BaseModel):
    x: str

print(dump_as_type(B(x="1"), A))
# {'x': 1}

print(dump_as_type(B(x="1"), B))
# {'x': '1'}

Though right now it has the following unfortunate error message if parsing fails:

print(dump_as_type(B(x="a"), A))
"""
pydantic.error_wrappers.ValidationError: 1 validation error for ParsingModel
value
  value is not a valid dict (type=type_error.dict)
"""

dmontagu on 16 Sep 2019

well, well if we can fix the above error, I'm happy with this change, let's leave strict_models to another issue/day.

@tiangolo do you agree with this?

samuelcolvin on 16 Sep 2019

@samuelcolvin You actually did find a problem -- while the dump_as_model function does work as intended right now, the implementation of .dict with as_type won't handle conversions of the type you've described properly on its own:

print(B(x="a").dict(as_type=A))
# {'x': 'a'}

At best this could be considered highly confusing, though the term "broken" is probably more appropriate. I'm trying to fix this now.

Edit: Fixed now, modulo some debugging and performance considerations.

dmontagu on 16 Sep 2019

@dmontagu you're awesome. :sunglasses: :rocket: :clap:

Thanks for the thorough explanation, exploration, and PR.

@samuelcolvin yes, this would be a huge improvement (from the FastAPI point of view) to avoid current "hacks" in FastAPI and to keep it closer to Pydantic. I'm sure this would also reduce a lot of future (or even current?) related bugs. I'm pretty sure this would help other current or future Pydantic-based tools as well.