Pydantic: Serialization of UUID childs

Created on 8 Jan 2020 · 8Comments · Source: samuelcolvin/pydantic

Bug

Output of python -c "import pydantic.utils; print(pydantic.utils.version_info())":

             pydantic version: 1.3
            pydantic compiled: False
                 install path: /usr/local/lib/python3.8/site-packages/pydantic
               python version: 3.8.0 (default, Nov 15 2019, 02:22:06)  [GCC 8.3.0]
                     platform: Linux-5.3.8-133.current-x86_64-with
     optional deps. installed: ['typing-extensions', 'email-validator']

from uuid import UUID

import asyncpg
from pydantic import BaseModel

struct_dict = {
   "id": asyncpg.pgproto.pgproto.UUID('c9ecc20f-acd6-429a-b781-c59f803af240')
}

class StructSchema(BaseModel):
    id: UUID

struct = StructSchema(**struct_dict)
assert type(struct.id) == asyncpg.pgproto.pgproto.UUID
assert isinstance(struct.id, UUID)

try:
    struct.json()  # raises error
except Exception as err:
    print(err)

struct.id = UUID(str(struct.id))

struct.json()  # ok, because now it really UUID instead of asyncpg's one

Funny one, but little annoying =)

bug serialization

Source

unitto1

Most helpful comment

@elprans During the development of this library we (well, mostly @samuelcolvin 😅) have done our best to consistently make a priority of small performance gains (to the tune of 1-2%; smaller than this for sure), and the end result is a library that is substantially faster than alternatives. Performance has been and is likely to remain a critical priority, and I think a good solution here really shouldn't come at the expense of performance, especially for users that don't take advantage of the capability.

I am still inclined to keep looking for ways to achieves the desired usability without sacrificing performance (and I suspect such an approach may well exist; @samuelcolvin's suggestion is a step in the right direction but maybe it can still be simplified further..).

dmontagu on 9 Jan 2020

👍2

All 8 comments

~I believe this is related to how python's built-in json.dumps function works -- the default keyword argument takes a dictionary of (exact) type to encoder. Pydantic adds an encoder for uuid.UUID by default, but I believe the lookup is done by actual type rather than subclass (presumably for performance).~

You can set a custom encoder for the asyncpg.pgproto.pgproto.UUID type in the model Config via json_encoders, which should fix this issue: https://pydantic-docs.helpmanual.io/usage/exporting_models/#modeljson

~It would be nice if subclasses were handled automatically, but I believe there are not-insignificant performance implications since you'd have to loop over all possible encoders looking for one that could encode, rather than being able to do a dict-based lookup. There is actually an open PR in FastAPI trying to address this (and looking at switching to subclass-based checks), but I'm not sure the performance implications are worth it.~

Edit: Some of this comment was wrong, as @elprans noted below

dmontagu on 8 Jan 2020

👍1

the default keyword argument takes a dictionary of (exact) type to encoder

Not true. The default argument takes a callback.

It would be nice if subclasses were handled automatically, but I believe there are not-insignificant performance implications since you'd have to loop over all possible encoders looking for one that could encode, rather than being able to do a dict-based lookup.

singledispatch is a thing, supports subclasses properly, and is fairly close to a naive dict lookup because the dispatch is cached.

elprans on 9 Jan 2020

@elprans You're totally right -- I got confused while digging through the pydantic source and thought a dict was passed directly to the default keyword argument, but that was wrong(it was pydantic_encoder).

I actually really like the idea of using singledispatch for this, though it might make the implementation a little longer, I think it would give the best of both worlds in terms of performance and usability, which I think is worth the more expanded (and perhaps subjectively uglier) syntax.

@samuelcolvin thoughts? I could actually see single dispatch substantially speeding up pydantic_encoder (and it may be a good idea to use with jsonable_encoder in fastapi too) if it's implemented the way I think it is.

dmontagu on 9 Jan 2020

Hmm I played with singledispatch and it is a lot slower than I expected. It adds about 50% overhead to fastapi's jsonable_encoder when called on a relatively large list of dicts of int to int, if you don't specialize the actual methods, so I think it is adding a lot of overhead. Fully specializing helps, but even after specializing as much as possible it was still about 5-10% slower than the existing non-singledispatch approach. So I'm no longer very optimistic about that approach from a performance perspective.

dmontagu on 9 Jan 2020

Direct answer to this question:

We have Config.json_encoders which could be used here to add an extra types to convert.
We have Config.json_dumps allows you to provide your own function for serializing JSON
json() can already take a encoder argument which allows you to implement your own function to pass to default in json.dumps

I think we already provide a lot of escape hatches for customising how JSON is encoded while providing a decent and performant default.

More generally, this also relates to #951 and how we extend pydantic to support "partial serialisation" e.g. serializing a model to a json serialisable dict (originally suggested and partially implemented by @tiangolo oin #317 but that was closed). In fact it relates to lots of issue, I tried to gather these by labeling them all "serialization".

I promised to work on #951 but haven't got around to it yet.

Surely internally singledispatch is doing some variant of

function_lookup = [(TypeA, convert_a), (TypeB, convert_b), ...]
...
for type_, func in function_lookup:
    if isinstance(v, type_):
        return func(v)
raise CantDoIt(...)

Which is always going to be slower (though more user friendly) than what we're doing now which is essentially:

function_lookup = {TypeA: convert_a, TypeB: convert_b, ...}
...
try:
    func = function_lookup[type(v)]
except KeyError:
    raise CantDoIt(...)
else:
    return func(v)

I would propose we do the following:

Allow another callback to be called if pydantic_encoder fails. That method could implement something like the above for type_, func in function_lookup: approach, performance isn't as important as it's up to the developer how to implement it. For this case they could simply do if isinstance(v, asyncpg.pgproto.pgproto.UUID):.... Maybe this is unnecessary, people could already accomplish this by overriding the encoder argument in json(), then calling pydantic_encoder and just catching TypeError and doing their own thing at that point.

Also (but not strictly related to this issue):

allow types to customise how they're serialised a bit like how __get_validators__ allows them to customise how they're parsed/validated and __modify_schema__ allows them to customise they schema generated for them. Perhaps via a method __serialize__(self, protocol) where protocol could be JSON or some other destination protocol
add a protocol argument to dict() which if not None causes pydantic_encoder to be called on each value, that would accomplish what was originally suggested in #317 and other issues

samuelcolvin on 9 Jan 2020

it was still about 5-10% slower than the existing non-singledispatch approach

IMO, 5-10% is a reasonable tradeoff for actually doing the Pythonic thing. Being able to pass a subclass to a function taking a superclass is a fundamental norm of Python. json.dumps() itself has no problems with subclasses of types it supports, and every stdlib function behaves likewise.

elprans on 9 Jan 2020

@samuelcolvin I assumed it attempted a cache lookup first which presumably would be as fast as what we have now (similar to how our GenericModel.__class_getitem__ works), then only if that fails it does the subsequent isinstance checks, and then after those checks registers the actual type so it can be looked up via cache on subsequent calls.

Admittedly, that is a big assumption, (and presumably wouldn't be too hard to check by just reading the source which I haven't done 😅). If it is not implemented that way, though, I wonder why not.

dmontagu on 9 Jan 2020