Hi guys! I'd love to use pydantic but I'm finding it hard to understand how I could use polymorphic types. Say that I have these classes:
class BaseItem:
pass
class ConcreteItemA(Base):
a: str
class ConcreteItemB(Base):
b: int
and their corresponding JSON representation, where the type becomes JSON field:
{
"type": "item-a",
"a": "some-string"
},
{
"type": "item-b",
"b": 10
}
I'd like to have a model (possibly BaseItem) that is capable of doing this kind of multiplexing, both in serialization and deserialization (i.e. I want to load a ConcreteItem, but I don't know which item until I read the json). Just to add more compelxity, the hierarchy could be deeper and some items might need self-referencing (i.e. an item that has a List[BaseItem]).
Is there anything built-in in pydantic? Any hint on how this could be achieved?
Thanks!
I'm currently unavailable.
Sorry, that was a bad joke about the issue id.
This is possible, but without knowing all of what you're doing it's hard to give a full solution, still here's a broad outline:
from typing import Union, List
from pydantic import BaseModel, validator
class ConcreteItemA(BaseModel):
type: str
a: str
@validator('type')
def check_type(cls, v):
if v != 'item-a':
raise ValueError('not item-a')
return v
class ConcreteItemB(BaseModel):
type: str
b: int
@validator('type')
def check_type(cls, v):
if v != 'item-b':
raise ValueError('not item-a')
return v
class BaseItem(BaseModel):
root: List[Union[ConcreteItemA, ConcreteItemB]]
m = BaseItem(root=[
{
'type': 'item-a',
'a': 'some-string'
},
{
'type': 'item-b',
'b': 10,
}
])
print(m.root)
print(m.dict())
Gives:
[<ConcreteItemA type='item-a' a='some-string'>, <ConcreteItemB type='item-b' b=10>]
{'root': [{'type': 'item-a', 'a': 'some-string'}, {'type': 'item-b', 'b': 10}]}
There are a couple of warts on this approach I'm afraid:
type, that should be fixed in #469, or you could use a single element enum, but that's just as uglyroot field, not on the base model itself, eg. you can't do Union[ConcreteItemA, ConcreteItemB].parse_obj(...) or something. I'm afraid I don't know a good way around this except to do something likee = None
for model_cls in [ConcreteItemA, ConcreteItemB]:
try:
return model_ls(**data)
except ValidationError as e:
error = e
raise e
Which is effectively what pydantic is doing when it sees root: Union[ConcreteItemA, ConcreteItemB] anyway.
Hope that helps, let me know if you need more info.
@samuelcolvin
Right now I'm doing the same loop to identify best suitable type for model
e = None
for model_cls in [ConcreteItemA, ConcreteItemB]:
try:
return model_ls(**data)
except ValidationError as e:
error = e
raise e
But, as my validations are done in backend service and I have to check quite large and nested json payloads for me it seems quite inefficient.
What about introducing new class, say Resolver. With which you can register all available models:
Resolver.add_model(model)And later you can call Resolver.validate(json) (like Model.parse_obj()) which will return:
None If couldn't find any appropriate model (because errors are swallowed in for loop) or raise exception with description why none of models were applicable (but the last one might be difficult to implement) So, Resolver.validate(json) will never raise exception.
But the more important property of Resolver - it will be able detect correct model. And, because Resolver will have all models, I can do that in the most efficient way.
For example, Resolver.validate(json) can start checking with unique required fields.
from datetime import datetime
from typing import Optional
from pydantic import BaseModel, validator
class ModelA(BaseModel):
type: str
message: str
class ModelB(BaseModel):
type: Optional[str]
message: str = None #required, default None
If json doesn't contain field type we can discard ModelA without checking other fields. The same technique could be applied for const fields.
I have to check quite large and nested json payloads for me it seems quite inefficient
You can do the deserializing once, regardless of how you later try to validate the data.
If you really care about performance, you could do something like
model_lookup = {'item-a': ConcreteItemA, 'item-b': ConcreteItemB, ...}
data = ujson.loads(raw_data)
if not isinstance(data, dict):
raise ValueError('not a dict')
try:
model_cls = model_lookup[data['type']
except KeyError:
raise ...
m = model_cls(**data)
The point is that by this state your into specifics of your application, which don't belong in pydantic.
But the more important property of Resolver - it will be able detect correct model. And, because Resolver will have all models, I can do that in the most efficient way.
I don't see how resolver can be significantly more efficient than the loop approach above, without significantly rewriting pydantic. The loop appoach is what we currently do for Union and it works well.
Best suitable (and most specific) model
I think this sounds a bit magic, either data is valid for a model or it's not - some floating measure of "specificity" sounds like an unnecessary complexity. If it is need, again it's probably application specific.
I don't personally think Resolver is much use, but if you wanted a utility function for trying to validate data against multiple models that could be done as part of #481, it could work on both dataclasses or models which would be useful.
I think this sounds a bit magic, either data is valid for a model or it's not - some floating measure of "specificity" sounds like an unnecessary complexity
I think it's already working that way. If we have two models both successfully validating json - the model, which will be returned depends on order in for loop (which model happens first).
What to do, if data valid for two models?
For my case (where models are used for routing):
class IssueAction(str, Enum):
opened = 'opened'
class IssueEvent(BaseModel):
action: IssueAction
issue: IssuePayload
class IssueOpened(IssueEvent):
action: Final[IssueAction] = IssueAction.opened
@webhook.handler(IssueEvent)
async def issue_event(issue: IssueEvent):
print("[EVENT] Some general event")
@webhook.handler(IssueOpened)
async def issue_opened(event: IssueOpened):
print("[EVENT] Issue was opened")
For incoming json payload:
{
"action": "opened",
"issue": { "id": 5 }
}
it will be valid for both cases, and which handler will be called depends on the order of registering handlers. That's not what would I want.
The point is that by this state your into specifics of your application, which don't belong in pydantic.
That's the point of my idea - do not tie to applications specifics and do not write (these checks already should be done in pydantic validation sooner or later)
if data['type'] == "issue":
if data["status"] == "opened":
process_opened_issue(data)
elif data["status"] == "closed":
process_closed_issue(data)
elif data['type'] == "pull request":
...
Because:
If you really care about performance, you could do something like
I'm thinking about general solution, where you describe protocol (models) and subscribe to interesting events (model applicable to json) and this works without any other partial data parsing and custom routing logic.
I don't personally think Resolver is much use
If implemented in general case it might be in use for any webhook implementation (and I think this is a common case where any service wants to integrate with other third-party service; also it's a case if you want to get real-time notifications - instead of constant polling). Because almost always you will get all events (different json types) in webhook call.
That's why I propose Resolver someone in pydantic. It's different than general BaseModel interface (where you know under which Model you want to validate data on), whereas for Resolver you have to determine model and validate data on that model (simultaneously, because determine model is also validation). And for loop approach (as shown above) not working in that case, when we are dealing with general and more specific inherited models (IssueEvent and IssueOpened)
How about using a custom data type that grabs the type name from values in the validator?
from enum import Enum
from pydantic import BaseModel
class ItemA(BaseModel):
x: int
y: int
class ItemB(BaseModel):
i: str
j: str
class ItemType(Enum):
A = 'item-a'
B = 'item-b'
class PolyItem:
type_map = {
ItemType.A: ItemA,
ItemType.B: ItemB,
}
@classmethod
def __get_validators__(cls):
yield cls.validate
@classmethod
def validate(cls, v, values):
item_type = values['type']
ItemModel = cls.type_map[item_type]
return ItemModel(**v)
class Record(BaseModel):
type: ItemType
item: PolyItem
>>> Record(type='item-a', item=dict(x=1, y=1))
Record(type=<ItemType.A: 'item-a'>, item=ItemA(x=1, y=1))
>>> Record(type='item-b', item=dict(x=1, y=1))
ValidationError: 2 validation errors for Record
item -> i
field required (type=value_error.missing)
item -> j
field required (type=value_error.missing)
The PolyItem custom data type implicitly requires a type field in values and that each ItemType has a model in the type_map, so the validate method will throw KeyErrors in those scenarios. It also is currently not validating that v is a dict/mapping.
Most helpful comment
I'm currently unavailable.