Pydantic: Customizing both decoding and encoding of a type

Created on 29 Oct 2019 · 35Comments · Source: samuelcolvin/pydantic

I'm considering adopting pydantic in a project that serializes numpy arrays as base64-encoded gzipped strings. The current serialization solution allows registering type-specific hooks for encoding and decoding, so that after registering the hooks any field annotated as numpy.ndarray will automatically be handled correctly:

class Example(object):
    big_array: numpy.ndarray

ex = Example(big_array = numpy.arange(50))

serialized = serialize_to_dict(ex)
print(serialized)
# { 'big_array': 'H4sIAAEAAAAC/xXOyRHCUAwE0VQUgA7MArZjocg/Dfqf+5VG39eOdryTne68dz47186985BOpgsg\nhCDCCCSUYMIZ53MHZ5xxxhlnnHHGBRdcziAuuOCCCy644Iorrriez3DFFVdccX1+f1+8uIe+AAAA\n' }

deserialized = deserialize_dict(serialized, Example)
print(deserialized.big_array)
# array([0, ..., 49])

I noticed that Config has a json_encoders field that seems to allow this for encoding, but I haven't seen a way to customize decoding (maybe I'm just too sleep-deprived). Is there a way to achieve the above behavior using pydantic?

feature request serialization

Source

kryft

👍1

Most helpful comment

Just wanted to chime in and also voice my support for an easy way to specify custom serialization/deserialization on the serialized class itself (not the containing class).

Maybe Pydantic could check each unrecognized/nonstandard field for serialize and deserialize methods?

Happy to help out if this is still of interest.

This came up in FastAPI as well: https://github.com/tiangolo/fastapi/issues/1285

noah-built on 10 Jun 2020

👍4

All 35 comments

see #380 for discussion of the same thing. At the end that issue goes off topic, so start from the beginning.

samuelcolvin on 29 Oct 2019

@samuelcolvin Great, thanks for the quick reply! I somehow overlooked (I blame sleep deprivation) that validators can be used for parsing as well as validating.

In a sense that feels obvious in retrospect (if customizing parsing is possible at all, validators are a logical place to look, since it's hard to separate parsing from validation), but maybe it would be helpful to make this more explicit in the documentation.

I'm not sure what the best way would be. The example in the "Custom Data Types" section could be adjusted to have custom parsing and serialization, but it might still be hard to discover that if you're just scanning the documentation to figure out whether pydantic can do this or not.

Perhaps the "Exporting models" subheading under "Usage" could be renamed to something like "Importing and exporting models" and it could cover both customizing the conversion between class instances and dicts (what I was trying to figure out) and the conversion between dicts and json (what is currently discussed under "Custom JSON (de)serialisation"). It could include something like your second example from #380, but with a json_decoders Config added to include everything in one example.

If you think something like this would be a good idea, I could try to make a PR with a suggestion at some point (maybe I should try to use the library first). Thanks for making pydantic!

kryft on 30 Oct 2019

I would just extend /usage/validators/ to be more explicit about parsing as well as validation.

It would also be useful to add comments with links to other sections saying "you can also perform parsing on a per-field basis using validators", e.g. links from:

/usage/models/#helper-functions
/usage/types/#custom-data-types
/usage/exporting_models/#custom-json-deserialisation

samuelcolvin on 30 Oct 2019

👍1

class TypedArray(numpy.ndarray):
    @classmethod
    def __get_validators__(cls):
        yield cls.validate_type

    @classmethod
    def validate_type(cls, val):
        return numpy.array(val, dtype=cls.inner_type)

class ArrayMeta(type):
    def __getitem__(self, t):
        return type('Array', (TypedArray,), {'inner_type': t})

class Array(numpy.ndarray, metaclass=ArrayMeta):
    pass

class Model(BaseModel):
        class Config:
            json_encoders = {
                numpy.ndarray: lambda x: "ingenious encoding",
            }
    values: Array[float]

If I want to add customized serialization to the example in #308, the above works, but is there a good way to get the same encoding behavior everywhere? Write my own class that inherits from BaseModel and put the Config there, and then have all the models inherit from that?

It would appeal to my sense of symmetry if I could somehow define the encoding behavior in TypedArray where the parsing validator is defined, but I couldn't think of a way to do that.

kryft on 30 Oct 2019

Not currently, but's a very interesting idea.

667 proposed `__get_schema` for customising the schema associated with types, I think it sounds like a good idea to accept a `serialise__` method which guarantees to return a "simple" (e.g. JSON valid) type.

This would be called (optionally I guess) by dict() and always by .json().

There was a discussion on more-or-less this subject recently on python-ideas called "JSON encoder protocol", but like almost all python-ideas conversations I think it descended into a competition for people with beards and sandals to demonstrate how long they've been writing python and how small minded and pedantic they can be; it also doesn't seem to be available anywhere index-able by google. So it achieved sweet FA.

@dmontagu what do you think of this? I might also help with #692?

samuelcolvin on 30 Oct 2019

😄1

this might also be a good solution for #317

samuelcolvin on 30 Oct 2019

Right, I was looking for something like __serialise__. Would it be difficult to add? (I imagine it might not be, but I'm not that familiar with pydantic's internals.)

kryft on 30 Oct 2019

@samuelcolvin So would it be as simple as returning

{
    get_key(k): if hasattr(v, '__serialise__') then v.__serialise__() else v
    for k, v in self._iter(
        # ...
    )
}

from .dict(), or are there other considerations? I could try to make a PR for this (with documentation updates) if you'd like.

I need something like this myself, although it seems like I can get by for now with something like

from pydantic import BaseModel

class MyBaseModel(BaseModel):
    class Config:
        json_encoders = {
            numpy.ndarray: encode_numpy_array
        }

and then inheriting from MyBaseModel instead of BaseModel.

kryft on 11 Nov 2019

The main implementation would be more or less that, but it would need to work recursively somehow so a field foobar: List[MyComplexThing] called __serialise__ on every member of the list.

I'm also concerned about what to do with standard types that might need simplifying for output.

Perhaps we should do something like #317, e.g.:

we add a simplify kwarg to dict()
simplify=True causes pydantic_encoder to be called recursively on the the dict, would need modifying to look more like jsonable_encoder
pydantic_encoder looks for __serialise__ and calls it if it exists, thus model.json() would work with __serialise__ without the slow down of simplify=True.

What do you think?

Looking at that, I'm not sure implementing this will as simple as initially though, especially given that performance is important so we'll probably need a micro-benchmark. Feel free to start a PR, otherwise I'll work on it in a couple of weeks.

samuelcolvin on 11 Nov 2019

Hmm, I thought the recursion was already taken care of by ._get_value and ._iter? That is, .dict() calls ._iter, which calls ._get_value for each field; ._get_value sees that the foobar field is a list and calls itself for each item in the list; then each item is an instance of BaseModel and _.get_value will call .dict for it, which (in my naive implementation) would call __serialise__. I'm probably missing something since I'm reading the code for the first time.

When you talk about a need simplifying standard types, do you mean something like datetime objects that currently aren't touched when you call .dict()? So if you want to serialize to something else than JSON, you the output from .dict() may not be serializable as is.

I kind of like the idea that I could write MyClass so that MyClass(**my_obj.dict()) would work, but I don't think it matters in practice if I have to use MyClass(**my_obj.dict(simplify=True)) instead, so in that sense your proposed solution seems to work just as well as my idea.

If this turns out to be complicated, and you think you have the time and inclination to work on it in the near future, perhaps it's better if you do it given my unfamiliarity with the code base and the finer points of python (I don't have a good intuition about performance, for instance).

kryft on 11 Nov 2019

@samuelcolvin Hmm, one thing that just occurred to me is that if __serialise__ is a method of Array, then I guess that means that one needs to make sure that any numpy array that gets assigned to an Array-annotated field really is an Array, or serialization will fail. You could argue that this isn't a problem - it's an Array field, of course you need to make sure it's an Array - but in my mind the Array class exists only to configure serialization and deserialization of numpy arrays, so ideally most of my code wouldn't need to know about it.

If __serialise__ is a class method like validators are, then I guess it should be enough to use Array in field type annotations, and actual code can use standard numpy arrays.

kryft on 13 Nov 2019

Yes, I guess either:

you're custom Array type needs to return an instance of itself, not a raw numpy array
or, somehow use the field type, not the value to get __serialise__ which would be difficult.

I guess there's a third option where the Array type returns an actual numpy array but adds a __serialise__ method to it.

samuelcolvin on 13 Nov 2019

I was thinking specifically of a scenario like

class Foo:
    arr: Array

foo = Foo(arr=numpy.array([1,2,3]))
# ...
foo.arr = returns_numpy_arrays()

so the Array constructor wouldn't get called.

I didn't realize that getting the class from the field type would be difficult (I thought that was already being done under the hood somewhere in BaseModel, but I haven't actually delved that deep into the implementation).

Anyway, I guess having to call Array explicitly in some places isn't the end of the world; I just thought I'd mention this in case there was a good solution for avoiding that.

kryft on 13 Nov 2019

foo.arr = returns_numpy_arrays() will be fine if you have validate_assignment = True.

What's returned by the validators yielded by Array.__get_validators__() is completely your choice.

samuelcolvin on 13 Nov 2019

👍1

Ah, good point, so I can just add that to the Config class of MyBaseModel to make that the default behavior in all classes (intended to be serialized).

kryft on 13 Nov 2019

When a custom json_encoder is specified in a Model's Config class, when the .json() method gets called does it look at the type of the object actually assigned to the attribute or the type specified as a type hint for that attribute to identify the encoding function?

I think I am trying to do the same as mr. @kryft but with pandas DataFrames. While pandas has many functions for saving DataFrames the json functions do not produce round-trip-able json for DataFrames with multilevel indexes or column headers. Therefore it's necessary to store extra metadata about the DataFrame alongside the json output so the DataFrame can be properly re-constructed.

import typing as tp
from pydantic import BaseModel
import pandas as pd

class IntermediateSplitDataFrame(BaseModel):
    data: tp.Sequence[tp.List]
    columns: tp.Sequence[tp.Union[tuple, str, int, float]]
    columns_type: str
    columns_names: tp.Sequence[tp.Union[str, None]]
    index: tp.Sequence[tp.Union[tuple, str, int, float]]
    index_type: str
    index_names: tp.Sequence[tp.Union[str, None]]

    @classmethod
    def from_dataframe(cls, dataframe):

        as_dict = dataframe.to_dict(orient='split')
        imd = cls(
            data = as_dict['data'],
            columns = as_dict['columns'],
            columns_type = dataframe.columns.__class__.__name__,
            columns_names = list(dataframe.columns.names),
            index = as_dict['index'],
            index_type = dataframe.index.__class__.__name__,
            index_names = list(dataframe.index.names),
        )
        return imd

    def make_index(self, axis=0):
        """reconstruct a row or column index..."""
        # do_stuff
        # return a pandas.Index or pandas.MultIndex
        return out_index

    def to_dataframe(self):
        """Encode de-serialized raw data back into a pandas DataFrame...

        Hopefully with the same structure, Index and column format it was exported as.
        """
        index = self.make_index(axis=0)
        columns = self.make_index(axis=1)
        return pd.DataFrame(self.data, index=index, columns=columns)

def frame_to_dict(dataframe):
    imd = IntermediateSplitDataFrame.from_dataframe(dataframe)
    return imd.dict()

So below I am using the above Model more like a validator and convienient place to store the encoding function (hence the name IntermediateSplitDataFrame) than a type.

class AppModel(BaseModel):
    dframe: pd.DataFrame = None

    class Config:
        arbitrary_types_allowed = True
        json_encoders = {
            pd.DataFrame: frame_to_dict
        }

    @validator('dframe', pre=True)
    def validate_dataframe(cls, v):
        if isinstance(v, dict):
            imd = IntermediateSplitDataFrame(**v)
            return imd.to_dataframe()
        elif isinstance(v, pd.DataFrame):
            return v
        else:
            raise ValidationError("must be a DataFrame or raw representation of IntermediateSplitDataFrame.")

So my problem is that I want a DataFrame to live on that attribute on the instance of the AppModel model but when it gets serialized to be serialized as the IntermediateSplitDataFrame. I was under the assumption that the type specified in the json_encoders had to be the type of the object assigned to the instance attribute but now im not sure.

drafter250 on 14 Nov 2019

when you specify a custom json_encoder for a type in the config class when the .json() method gets called does it look at the type of the object actually assigned to the attribute or the type specified as a type hint for that attribute to identify the encoding function?

It'll use the the of that object, not the type hint. I think it should stay that way, I was just thinking aloud.

samuelcolvin on 14 Nov 2019

Before I go any further thank you so much for pydantic it has already saved me so much time!

So what should the validator really be doing? validating the live python object I am assigning to the attribute or validating the dictified input coming from the parse_raw() method? Right now i have it doing both and that doesn't seem like what it's really meant for?

essentially I think I am looking for a proxy type that allows me to control how the python object I don't have control over is encoded(dictified) -> serialized then de-serialized -> de-coded(dict to pyobject).

drafter250 on 14 Nov 2019

I'm afraid I'm not clear, there's no parse_raw function in the code above. Please create a separate issue to discuss this or ask on stack overflow, it sounds pretty specific to your case and not related to the discussion here.

samuelcolvin on 14 Nov 2019

Sorry I meant the BaseModel.parse_raw() class method? sorry I didn't specifically exercise it in the example above.
but I should be able to do...

app = AppModel(dframe=somedataframe)
app_json = app.json()
app2 = AppModel.parse_raw(app_json)

isinstance(app2.dframe, pd.DataFrame)
>>> True

My example above does work but I feel like I am hacking it and not using pydantic properly.

drafter250 on 14 Nov 2019

Please create a separate issue.

samuelcolvin on 14 Nov 2019

@dmontagu what do you think of this? I might also help with #692?

Sorry, for some reason I didn't notice this sooner

Yes, I like this idea, and think it does a good job of addressing my use case from #692
For it to be really useful I think it would be critical to have the auto-generated schema reflect the serialized output (e.g., use string if serializing to base64)
It would be great if there were hooks for both serialization and schema on the type itself

Correct me if I'm wrong, but I think it is much more performant to specify a json_encoder, rather than recursively checking for and calling a __serialize__ method.
- If this is the case, I would prefer to push toward an implementation using json encoders
It would be great if there was a way to "upstream" the json_encoders to parent models to ensure high-speed serialization without heavy code repetition
- Possible implementation: in MetaModel when building a type, you could:
  1. Check if the model's config already has json_encoders defined; if so, just use that; otherwise:
  2. Build a mapping of encoders by looking for the json_encoders attribute in each field's config (if the field's type has in one way or another specified a custom encoder)
  3. Update the in-the-process-of-being-created type's Config to contain those json encoders in a map, and if there are conflicts, raise an error indicating manual specification is required
- With this approach, you could set Config.json_encoders to {} if you 1) want to make use of distinct types with conflicting json encoders in the same model, and 2) don't intend to json serialize (otherwise you'd need to specify anyway)
- Undesirable alternative 1: require placing the relevant encoders on the outermost model
  - Problem: requires specifying the encoders on every potentially-top-level intermediate model when nesting (not DRY, easier to forget)
- Undesirable alternative 2: relying on dynamic checks to the json_encoders attribute
  - Problem: likely slow, especially for list/dict serialization (as discussed in #317)
I think you could support a custom __serialize__ method alongside the approach described above by automatically registering a json encoder making use of the __serialize__ method for any type that implements it (and, again, automatically propagating it upward). (This would be overridden in the presence of a manually specified config.)

dmontagu on 14 Nov 2019

Okay, I'm going to take a crack at this and see how I get one. I'll let you know how I get on with a draft PR.

samuelcolvin on 15 Nov 2019

👍1

@dmontagu I can't comment on performance, but if this is implemented using json_encoders, doesn't that mean that you still couldn't customize what happens when you call .dict()? This doesn't matter in practice for me personally right now (I'm serializing everything to JSON), but it feels odd that validators customize how you turn a dict into an object, but you can't customize how an object gets turned into a dict. It would be a practical issue if you want to serialize a dict into something else than JSON.

You did mention that it would be nice if there were hooks for serialization on the type itself. Did you mean the json_encoders bubbling upwards that you detailed later in the comment, or something else (like maybe the optional custom __serialize__ method)?

I'm currently using undesirable alternative 1 as a temporary workaround: I have a class MyBaseModel(pydantic.BaseModel) with a json_encoders in the Config, and I have to make sure all my serializable classes inherit that instead of pydantic.BaseModel.

kryft on 15 Nov 2019

One note here is that JSON is fairly suboptimal for NumPy array encodings as you often need to serialize the raw bytes to ASCII and back which takes quite a bit of time. We had switched to a msgpack solution which is much more optimal for shipping arrays. The __serialize__ is a great idea and will help a large variety of use cases, but do consider JSON alternatives if possible.

dgasmith on 20 Nov 2019

@dgasmith I'm assuming you mean serializing to a JSON array of numbers?

If serialization speed is an issue, I would think base64 encoding the raw bytes of the array (along with json-encoding the dtype if desired/necessary) would be even faster than msgpack (and would allow you to keep using plain json). By this, I mean by calling base64encode on the numpy array's raw bytes buffer.

But I haven't tested this assumption. Do you have evidence that this is wrong?

(This is what I do in my own projects where I need to serialize float/int arrays.)

dmontagu on 22 Nov 2019

With msgpack you can skip the base64encode as you can store raw bytes (up to 2GBish) in msgpack directly. See an implementation here.

We founding avoiding the base64 encoding to be maybe 2x faster. In addition, with numpy or similar the array.bytes() format gives you float32/float64 like msgpack expects and special classes can be skipped for even more performance. Another nice item is that the msgpack sizes will be smaller as well as you skip the encoding:

>>> a = np.random.rand(4)
>>> len(a.tobytes())
32
>>> len(a.tobytes().hex())
64
>>> len(base64.b64encode(a.tobytes()))
44

dgasmith on 22 Nov 2019

I see, thanks for sharing, yeah that definitely seems worthwhile for certain applications if you are willing to ditch JSON.

However, in this case, I still think you'd be better off dropping __serialize__, and instead having the numpy-array-to-bytes conversion happen inside the msgpack.dump call (or similar), which you'd call on the result of .dict() (similar to using a json_encoder above). This appears to be supported, at least by the msgpack package.

dmontagu on 22 Nov 2019

Right, we just hack a derived class ontop of the BaseModel (see here). This works fine, but we are always looking to have things a bit more canonical so we don't have have to re-engineer between pedantic release.

dgasmith on 22 Nov 2019

👍1

Most of the JSON isn't float arrays, and keeping that data human-readable (and diffable) is more valuable than improving the performance of the array serialization.

Anyway, serializing and deserializing numpy arrays was mainly intended as an example of a use case for __serialize__. I still think there should ideally be a way to ensure that if my_obj == MyClass(**some_dict), then my_obj.dict() == some_dict. If it's for some reason impossible to implement __serialize__ without adversely affecting performance for people who don't want to use it, then I guess that's a tough call.

kryft on 23 Nov 2019

👍1

Just wanted to chime in and also voice my support for an easy way to specify custom serialization/deserialization on the serialized class itself (not the containing class).

Maybe Pydantic could check each unrecognized/nonstandard field for serialize and deserialize methods?

Happy to help out if this is still of interest.

This came up in FastAPI as well: https://github.com/tiangolo/fastapi/issues/1285

noah-built on 10 Jun 2020

👍4

Custom validation is already supported, via the __get_validators__ method, see Custom Types. Custom schema is also possible via __modify_schema__.

Customising serialization while maintaining performance is not at all trivial. But I'm going to try and improve it in v2.

samuelcolvin on 10 Jun 2020

Understood -- it just pains me to spread my serialization logic between __get_validators__() validate() and json_encoders in the containing class's config class, but it is certainly workable.

Sounds great for v2. If you'd like any help when you get there please just give a shout. Overall awesome library! Very much appreciate it.

noah-built on 10 Jun 2020

🚀1 👍1

I'm facing a similar scenario to the one being discussed here where I need to serialise and deserialise a large class/object as efficiently as possible.
The class is made up of several custom component classes (which I will also be converting to use pydantic) and the bulk of the data is stored in numpy arrays.
I've been using pickle but the resulting file is at almost 3 times larger than the size of the object in memory.

Based on what I've read in this thread I'm going to convert all my custom models to use pydantic, get the data out probably using
my_object_data = my_object_instance.dict()
and then attempt to serialise that to a message pack file using the handy serialiser utilities provided in this thread
msgpackext_dumps(my_object_data)

Am I missing anything?
(Also I appreciate this is a feature discussion and not a help thread but this is such a niche issue and the information here has been so helpful)

SamuelBradley on 24 Jun 2020

I have a problem where I would like to have a custom type for a specific date format.
At the moment I don't see how I can implement this by just defining the custom type.
But if I understand @samuelcolvin, then with the proposed change that would look something like this:

class MyDate():

    def __init__(self, date: datetime) -> None:
        self.date = date

    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v: str) -> 'MyDate':
        return cls(datetime.strptime(v, '%Y%m%d'))

    @classmethod
    def __serialize__(cls, v: 'MyDate'):
        return v.date.strftime('%Y%m%d')

Or is there another way? I'm only looking for solutions where the serialization is defined in the custom type not in the model.

Also is there any update on this?