First, thanks for Pydantic. "Just types". It's awesome. :tada: :taco: :cake:
I want to update the schema generation in JSON to implement the JSON Schema spec, which is in the process of becoming an IETF RFC.
But first, I want to discuss what would be the most appropriate way of doing it to keep it aligned with the project's direction.
Sorry for the long post. I want to be as explicit as possible and demonstrate the motivation clearly.
I know the great work done at https://github.com/samuelcolvin/pydantic/pull/190 which implements the current schema generation in a JSON format. It was first discussed at https://github.com/samuelcolvin/pydantic/issues/129.
The current implementation generates a schema in JSON format. But is not the same as the one described in the JSON Schema specification.
The changes/additions required are relatively small, as most of the work is already done. But they are spread across the project. But the benefits could be quite big for the ecosystem around Pydantic.
First, some motivation for having the "JSON Schema" spec instead of the current schema generation.
There is currently quite some support, tools and "ecosystem" for "JSON Schema" (as defined in the spec).
As a simple example, easy to test, Visual Studio Code has good support for JSON Schema, including auto-completion and type checking.
{
"type": "object",
"title": "Main",
"description": "This is the description of the main model",
"properties": {
"foo_bar": {
"type": "object",
"title": "FooBar",
"properties": {
"count": {
"type": "int",
"title": "Count",
"required": true
},
"size": {
"type": "float",
"title": "Size",
"required": false
}
},
"required": true
},
"Gender": {
"type": "int",
"title": "Gender",
"required": false,
"choices": [
[1, "Male"],
[2, "Female"],
[3, "Other"],
[4, "I'd rather not say"]
]
},
"snap": {
"type": "int",
"title": "The Snap",
"required": false,
"default": 42,
"description": "this is the value of snap"
}
}
}
schema.json$schema of this JSON document, to define it as a "JSON Schema", with: "$schema": "http://json-schema.org/draft-07/schema" (VS Code will auto-complete that field schema declaration too). That field, added there only for this test, tells VS Code that this file is itself a JSON Schema (because the JSON Schema spec itself is declared using JSON Schema).
type values that has float, removing it and hitting Ctrl + Space to trigger the auto-complete, shows the types defined by the spec.
{
"$schema": "http://json-schema.org/draft-07/schema",
"type": "object",
"title": "Main",
"description": "This is the description of the main model",
"required": [
"foo_bar"
],
"properties": {
"foo_bar": {
"type": "object",
"title": "FooBar",
"properties": {
"count": {
"type": "integer",
"title": "Count"
},
"size": {
"type": "number",
"title": "Size"
}
},
"required": [
"count"
]
},
"Gender": {
"type": "string",
"title": "Gender",
"enum": [
"Male",
"Female",
"Other",
"I'd rather not say"
]
},
"snap": {
"type": "integer",
"title": "The Snap",
"default": 42,
"description": "this is the value of snap"
}
}
}
./schema2.json"$schema": "./schema2.json". It defines this JSON file as using the $schema defined above in the previous file:{
"$schema": "./schema2.json"
}



VS Code is just an example of the current support for JSON Schema (as in the spec) given by tools and the ecosystem. There are many other examples, but this one seemed quite simple and explicit.
Additionally, OpenAPI (previously known as Swagger), which is now part of the Linux foundation, is based on JSON Schema up to some extent.
My ultimate plan is to use Pydantic to generate OpenAPI schemas from Python web APIs. Those schemas can then be used directly with systems like Swagger UI: https://petstore.swagger.io to create interactive and explorable documentation for APIs. And to generate Open API clients in different programing languages, etc.
But all that would come afterwards, as my personal experiments. What I want to add to Pydantic now, which would be the first step for my future plans, is support for JSON Schema (as in the spec).
I see different ways to achieve what I want:
This would probably be the most straightforward way. It would minimize code duplication and confusion, as there are already schema_json methods.
It would be a breaking change for anyone that already requires the currently generated schema. So, I guess a version bump would be required.
This wouldn't disrupt implementations based on the current functionality, wouldn't required a version bump.
But it would add quite a bunch of code to maintain, as it would have to touch a lot of the methods in several points, so instead of touching them, it would be duplicating all that functionality with little changes.
I could implement it all as a set of functions, without touching the methods from the original classes. But for example, I need to access BaseModel.__fields__, which is a private property. The one I see available is BaseModel.fields, but that's an instance @property method, not a class method. And all the schema generation is done at the class level, not the instance level.
So, I could add little changes to be able to access the data and process it with additional isolated functions in a separated module, part of the same package.
Implement it as functions that use BaseModel.__fields__ and related properties, even though most of them are private properties. It woudn't be very clean, but wouldn't touch the current actual code.
Add just the minimum changes to access the data. And create a new package that uses that data and generates JSON Schema from that, but not part of Pydantic.
Not change anything inside of Pydantic, just create an external package that accesses private properties in an ugly way...
I already started with some local tests, and already tried starting with some of the possible paths above, but I see that I'll end up touching a lot of code to implement it in one of those ways, and it might not be the preferred way. So I decided to better just ask first what seems more viable for the project and start the discussion here.
What do you think? How should I proceed?
First of all, sorry for the delayed response, I was on holiday, then life got in the way of code.
In general, my answer is very simple: yes. Let's go for it.
Let's do A., frankly I'm not that bothered by backwards compatibility on things like this, I'd rather get it right. @Gr1N I think uses schema a fair bit, would this be ok way you?
The reason I shied away from JSON Schema :tm: are explained in #129, and also that implementing a schema seemed pretty complicated at the time, I wanted to start with something simple.
Questions/concerns:
[...] delayed response [...]
No worries, it happens to me all the time :sweat_smile: I hope you had a nice holiday.
[...] Let's go for it. [...]
Awesome!
Actually, I already started :smile: .
...I have been reading, re-reading and studying the specifications for JSON Schema and OpenAPI, there are a couple of differences but are minimal. My plan is to update the implementation for JSON Schema in a way that is compatible with OpenAPI schemas too, taking into account the minor differences. That way it can be used with JSON Schema directly, but also others (me) can develop OpenAPI stuff based on it.
[...] Let's do A [...]
Awesome again!
The reason I shied away from JSON Schema [...]
Yep. I understand.
what happens about errors? [...]
In short: we don't change them.
There's no need to change the errors. JSON Schema actually doesn't stipulate how to return the errors.
OpenAPI allows defining error shapes using the same schema format. It's optional and doesn't force any shape on them.
do pydantic's types (and custom types) map sensibly to JSON Schema types? [...]
Yes, in most of the cases. Pydantic maps very well from JSON, and serializes to JSON very well too. JSON Schema just declares JSON shapes, so I think it matches Pydantic very well. Specially custom types and sub-types/sub-models (what is referred to "recursive models" in the docs), all those get declared as sub schemas, very similar to what the actual Pydantic based code looks like.
Limitations: JSON Schema defines shapes for JSON content, and JSON itself has limitations compared to all the Pyhton data structures.
My plan is to update/implement JSON Schema support on top of Pydantic, of course, without imposing any limitation on the rest of Pydantic.
There will be some limitations for those (me) who want to have JSON Schema based on Pydantic, but it won't limit Pydantic's general usage.
Specific examples I've found so far (I don't think there's much more):
In JSON, object's keys must be strings. So, it's not possible to declare Python dicts with keys other than strings as JSON. E.g. dicts with keys of numbers, True, False (other objects or classes wouldn't even make much sense as keys in JSON). This is not a limitation of JSON Schema, but JSON itself. In that case, we, developers using Pydantic, should make sure not to declare dicts with keys other than str.
Python Enum would map to enum in JSON Schema, but in JSON Schema it is a predefined list of values, not of keys that map to values. But it is a very straightforward simple change from the currently generated schema.
...in most of the other cases is just a change in names, e.g. from list to array, from set to array with a uniqueItems set to true, etc.
[...] the workaround [...]
The "workaround" is mainly: those of us who want to use Pydantic to generate JSON Schema should not declare models with types that are not compatible (dicts with keys other than str). But it doesn't affect any other use or functionality of Pydantic.
are you proposing to implement this yourself or asking me to?
Yes! Of course, I'll do it. I already started :nerd_face: ...and this is what I have planned for this whole weekend, at least to start :joy: .
Here are a couple things I have in mind and some questions:
While generating one schema, I want to also define sub-schemas as top-level schemas with references. The end result for JSON Schema consumers wouldn't change. But this will help, for example, automatic code generation tools to re-use the generated code for one schema referencing it on the others, avoiding code and schema duplication.
I had already started creating a set of functions to do the work, I based them on the current methods of the BaseModel and Field classes. Having the schema logic in separated functions in a separate module would also allow using them independently of a specific model class. Specifically, it would allow passing a set of classes and getting a single JSON schema that defines them as several sub-schemas (and this can then be re-used by other tools, like OpenAPI stuff). And We can call these functions from inside the existing methods for Field and BaseModel. What do you think?
Is there any reason why you wanted to have Field._schema as a "protected member" (because of the leading _)? I would like to change it from Field._schema to Field.schema, to use it from those functions (described above). What do you think?
Later, probably in a different PR after this one, I would like to extend the Schema class to receive as keyword arguments several (most) of the validation keywords from JSON Schema. Then use them to do the schema declaration but also to set the corresponding validations at the same time.
constr as type declarations, PyCharm gets the correct type hint from the return of the function. But VS Code with the Python "language server" or editors based on Jedi (VS Code too, Vim, Atom, Sublime, etc) don't get the type hint from the function's return type (the str type returned by constr). By setting the equivalent of min_length=2, max_length=10 from constr as parameters in the Schema and declaring the type as str, these editors (and PyCharm too) would be able to get the hint that the type is str, to provide completion, etc. This on top of the previous advantage.It would allow, instead of declaring:
short_str: constr(min_length=2, max_length=10) = Schema(None, minLength=2, maxLength=10)
...declaring:
short_str: str = Schema(None, min_length=2, max_length=10)
What do you think? Would it seem acceptable? ...although this is probably a question for a subsequent PR.
I'm using schema only for documentation, so I would be happy for any improvements on this!
While generating one schema, I want to also define sub-schemas as top-level schemas with references.
Not sure what you mean by this. Perhaps easiest to submit a PR and we can discuss from there.
I had already started creating a set of functions to do the work...
Yes, moving as much schema logic to a separate module as possible would be good
Is there any reason why you wanted to have
Field._schemaas a "protected member"...
Because I assume that in general users wouldn't want to access _schema directly. Much of python (eg. _asdict on named tuples) uses _... as a more of a warning than a hard barrier to external access. I don't mind changing it though.
Schema(None, minLength=2, maxLength=10)
I'm happy with that as long as constr(min_length=2, max_length=10) continues to work too.
[...] submit a PR and we can discuss from there.
Sure, I'm on it.
[...] moving as much schema logic to a separate module [...]
Cool.
[...] general users wouldn't want to access _schema directly. [...]
_asdicton named tuples [...]
Good point, thanks for the reference to the named tuples. I'll just leave it as is and don't pay attention to Pylint's complaints about that.
I'm happy with that as long as constr(min_length=2, max_length=10) continues to work too.
Awesome. I'll do that next.
I just submitted the first version of the PR. It's basically complete, but I would like you @samuelcolvin to check it and confirm that the changes seem acceptable. After that, I'll re-check and possibly augment the tests and update the docs.
Should we continue the conversation in the PR?
If @Gr1N has any feedback too, it would be more than welcome.
I'll close this issue as the PR directly related to it is already merged into master. :tada:
thank you @tiangolo
Most helpful comment
thank you @tiangolo