Pydantic: Support for value-based polymorphism

Created on 3 May 2019 · 6Comments · Source: samuelcolvin/pydantic

Hi guys! I'd love to use pydantic but I'm finding it hard to understand how I could use polymorphic types. Say that I have these classes:

class BaseItem:
   pass

class ConcreteItemA(Base):
   a: str

class ConcreteItemB(Base):
   b: int

and their corresponding JSON representation, where the type becomes JSON field:

{
    "type": "item-a",
    "a": "some-string"
},
{
    "type": "item-b",
    "b": 10
}

I'd like to have a model (possibly BaseItem) that is capable of doing this kind of multiplexing, both in serialization and deserialization (i.e. I want to load a ConcreteItem, but I don't know which item until I read the json). Just to add more compelxity, the hierarchy could be deeper and some items might need self-referencing (i.e. an item that has a List[BaseItem]).

Is there anything built-in in pydantic? Any hint on how this could be achieved?

Thanks!

question

Source

ralbertazzi

👍7

Most helpful comment

I'm currently unavailable.

samuelcolvin on 3 May 2019

😄10 🚀4 👀1

All 6 comments

I'm currently unavailable.

samuelcolvin on 3 May 2019

😄10 🚀4 👀1

Sorry, that was a bad joke about the issue id.

This is possible, but without knowing all of what you're doing it's hard to give a full solution, still here's a broad outline:

from typing import Union, List

from pydantic import BaseModel, validator

class ConcreteItemA(BaseModel):
    type: str
    a: str

    @validator('type')
    def check_type(cls, v):
        if v != 'item-a':
            raise ValueError('not item-a')
        return v

class ConcreteItemB(BaseModel):
    type: str
    b: int

    @validator('type')
    def check_type(cls, v):
        if v != 'item-b':
            raise ValueError('not item-a')
        return v

class BaseItem(BaseModel):
   root: List[Union[ConcreteItemA, ConcreteItemB]]

m = BaseItem(root=[
    {
        'type': 'item-a',
        'a': 'some-string'
    },
    {
        'type': 'item-b',
        'b': 10,
    }
])
print(m.root)
print(m.dict())

Gives:

[<ConcreteItemA type='item-a' a='some-string'>, <ConcreteItemB type='item-b' b=10>]
{'root': [{'type': 'item-a', 'a': 'some-string'}, {'type': 'item-b', 'b': 10}]}

There are a couple of warts on this approach I'm afraid:

you have to use a validator to force type, that should be fixed in #469, or you could use a single element enum, but that's just as ugly
unfortunately the polymorphism only works here on the root field, not on the base model itself, eg. you can't do Union[ConcreteItemA, ConcreteItemB].parse_obj(...) or something. I'm afraid I don't know a good way around this except to do something like

e = None
for model_cls in [ConcreteItemA, ConcreteItemB]:
    try:
        return model_ls(**data)
    except ValidationError as e:
        error = e
raise e

Which is effectively what pydantic is doing when it sees root: Union[ConcreteItemA, ConcreteItemB] anyway.

Hope that helps, let me know if you need more info.

samuelcolvin on 3 May 2019

@samuelcolvin
Right now I'm doing the same loop to identify best suitable type for model

e = None
for model_cls in [ConcreteItemA, ConcreteItemB]:
    try:
        return model_ls(**data)
    except ValidationError as e:
        error = e
raise e

But, as my validations are done in backend service and I have to check quite large and nested json payloads for me it seems quite inefficient.

What about introducing new class, say Resolver. With which you can register all available models:

Resolver.add_model(model)

And later you can call Resolver.validate(json) (like Model.parse_obj()) which will return:

Best suitable (and most specific) model
None If couldn't find any appropriate model (because errors are swallowed in for loop) or raise exception with description why none of models were applicable (but the last one might be difficult to implement)

So, Resolver.validate(json) will never raise exception.
But the more important property of Resolver - it will be able detect correct model. And, because Resolver will have all models, I can do that in the most efficient way.

For example, Resolver.validate(json) can start checking with unique required fields.

from datetime import datetime
from typing import Optional
from pydantic import BaseModel, validator

class ModelA(BaseModel):
    type: str
    message: str

class ModelB(BaseModel):
    type: Optional[str]
    message: str = None #required, default None

If json doesn't contain field type we can discard ModelA without checking other fields. The same technique could be applied for const fields.

likern on 3 May 2019

I have to check quite large and nested json payloads for me it seems quite inefficient

You can do the deserializing once, regardless of how you later try to validate the data.

If you really care about performance, you could do something like

model_lookup = {'item-a': ConcreteItemA, 'item-b': ConcreteItemB, ...}
data = ujson.loads(raw_data)
if not isinstance(data, dict):
  raise ValueError('not a dict')
try:
  model_cls = model_lookup[data['type']
except KeyError:
  raise ...

m = model_cls(**data)

The point is that by this state your into specifics of your application, which don't belong in pydantic.

But the more important property of Resolver - it will be able detect correct model. And, because Resolver will have all models, I can do that in the most efficient way.

I don't see how resolver can be significantly more efficient than the loop approach above, without significantly rewriting pydantic. The loop appoach is what we currently do for Union and it works well.

Best suitable (and most specific) model

I think this sounds a bit magic, either data is valid for a model or it's not - some floating measure of "specificity" sounds like an unnecessary complexity. If it is need, again it's probably application specific.

I don't personally think Resolver is much use, but if you wanted a utility function for trying to validate data against multiple models that could be done as part of #481, it could work on both dataclasses or models which would be useful.

samuelcolvin on 3 May 2019

I think this sounds a bit magic, either data is valid for a model or it's not - some floating measure of "specificity" sounds like an unnecessary complexity

I think it's already working that way. If we have two models both successfully validating json - the model, which will be returned depends on order in for loop (which model happens first).

What to do, if data valid for two models?
For my case (where models are used for routing):

class IssueAction(str, Enum):
    opened = 'opened'

class IssueEvent(BaseModel):
    action: IssueAction
    issue: IssuePayload

class IssueOpened(IssueEvent):
    action: Final[IssueAction] = IssueAction.opened

@webhook.handler(IssueEvent)
async def issue_event(issue: IssueEvent):
    print("[EVENT] Some general event")

@webhook.handler(IssueOpened)
async def issue_opened(event: IssueOpened):
    print("[EVENT] Issue was opened")

For incoming json payload:

{
 "action": "opened",
 "issue": { "id": 5 }
}

it will be valid for both cases, and which handler will be called depends on the order of registering handlers. That's not what would I want.

The point is that by this state your into specifics of your application, which don't belong in pydantic.

That's the point of my idea - do not tie to applications specifics and do not write (these checks already should be done in pydantic validation sooner or later)

if data['type'] == "issue":
 if data["status"] == "opened":
   process_opened_issue(data)
  elif data["status"] == "closed":
   process_closed_issue(data)
elif data['type'] == "pull request":
  ...

Because:

this is already validation, thus we get double validation (partial manual and later complete automatic with pydantic)
that validation / routing might be incorrect / incomplete
it's not always such simple to detect type based on one special field, depending on protocol it might be combination of several fields and presense (or even absense) of specific fields (even nested)

If you really care about performance, you could do something like

I'm thinking about general solution, where you describe protocol (models) and subscribe to interesting events (model applicable to json) and this works without any other partial data parsing and custom routing logic.

I don't personally think Resolver is much use

If implemented in general case it might be in use for any webhook implementation (and I think this is a common case where any service wants to integrate with other third-party service; also it's a case if you want to get real-time notifications - instead of constant polling). Because almost always you will get all events (different json types) in webhook call.

That's why I propose Resolver someone in pydantic. It's different than general BaseModel interface (where you know under which Model you want to validate data on), whereas for Resolver you have to determine model and validate data on that model (simultaneously, because determine model is also validation). And for loop approach (as shown above) not working in that case, when we are dealing with general and more specific inherited models (IssueEvent and IssueOpened)

likern on 3 May 2019

How about using a custom data type that grabs the type name from values in the validator?

from enum import Enum
from pydantic import BaseModel

class ItemA(BaseModel):
    x: int
    y: int

class ItemB(BaseModel):
    i: str
    j: str

class ItemType(Enum):
    A = 'item-a'
    B = 'item-b'

class PolyItem:
    type_map = {
        ItemType.A: ItemA,
        ItemType.B: ItemB,
    }

    @classmethod
    def __get_validators__(cls):
        yield cls.validate

    @classmethod
    def validate(cls, v, values):
        item_type = values['type']
        ItemModel = cls.type_map[item_type]
        return ItemModel(**v)

class Record(BaseModel):
    type: ItemType
    item: PolyItem

>>> Record(type='item-a', item=dict(x=1, y=1))
Record(type=<ItemType.A: 'item-a'>, item=ItemA(x=1, y=1))

>>> Record(type='item-b', item=dict(x=1, y=1))
ValidationError: 2 validation errors for Record
item -> i
  field required (type=value_error.missing)
item -> j
  field required (type=value_error.missing)

The PolyItem custom data type implicitly requires a type field in values and that each ItemType has a model in the type_map, so the validate method will throw KeyErrors in those scenarios. It also is currently not validating that v is a dict/mapping.