For Jupyter based frontends, we would love to see a common tabular format in JSON that we can render (in addition to or in lieu of the current HTML). This would provide us the flexibility to style and format according to data type, as well as have better hooks for theming of tabular data on frontends. Everyone has an opinion, let's give them flexibility to apply it.
It's important to us to support a common JSON format so that for R, Julia, and other languages also can display their DataFrames with similar formatting and styling out of the box.
The best one I've seen so far, with a great amount of discussion and collaboration, is the JSON Table Schema.
Update: In order to include both data + schema, we're using data resource which has media type application/vnd.dataresource+json
.
/cc @captainsafia @ellisonbg @jreback @TomAugspurger
xref #9146, #9166
I'll dig into the schema later, but just to make sure: the basic idea is for pandas to publish multiple outputs (application/html
, application/json
) wherever we publish just the HTML right now?
More concretely, what changes do we need to make to Series / DataFrames / Indexes to support this? IIRC there isn't a _repr_json_
equivalent of _repr_html_
.
Interesting - I just noticed they wrote a wrapper for pandas: https://github.com/frictionlessdata/jsontableschema-pandas-py
On the JupyterLab, notebook, and nteract side, we'd have https://github.com/frictionlessdata/jsontableschema-js to lean on.
the basic idea is for pandas to publish multiple outputs (application/html, application/json) wherever we publish just the HTML right now?
Yes. The media type (mime type in Jupyter parlance) would be something like application/vnd.table-schema.v1+json
.
While there's not a repr for arbitrary media types in IPython (we can evolve that as a result of this discussion), there is a way to display raw messages with IPython.display.display
:
IPython.display.display({
'application/json': releases
}, raw=True)
Which shows up in nteract as:
Hi. I'm one of the authors of JSON Table Schema, and also part of the team working on reference implementations for this and the related family of specs. The JavaScript implementation is just a little behind the Python one, and probably also of relevance here.
Happy to help.
_edit_: added link to the JavaScript implementation, in addition to the Python one previously linked.
By the way, on the nteract and jupyterlab side, it's pretty easy for us to iterate with new renderers and media types.
I don't really see a reason not to add this in pandas; The additional code shouldn't be too much of a burden.
Would clients expect to receive the entire DataFrame, and do their own truncation? I worry a bit about the overhead of publishing huge DataFrames. I would say follow the options in pd.options.display.max_rows
, etc. and only ship over some of the DataFrame (but need some way of saying that there's more...)
A few things directly related to the spec that pandas might have trouble with:
_metadata
should carry this, but IIRC we don't have a good story on propagating that though operations, so it's liable to be droppedWe are very happy to make any changes needed to https://github.com/frictionlessdata/jsontableschema-pandas-py in order to support this smoothly, and especially in reference to things like streaming data out of a DataFrame, or limiting the rows from a frame for preview, and so forth.
Started on this here: https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:json-schema very early, one test, no docs :D
Some design things I'd like to nail down before submitting a PR:
{'schema': schema, 'data': data}
where schema is a valid JSON table schema and data is like pd.DataFrame.to_json(orient='records')
{
"data": "[{\"a\":3,\"b\":3},{\"a\":0,\"b\":2},{\"a\":1,\"b\":1},{\"a\":3,\"b\":0},{\"a\":3,\"b\":1}]",
"schema": {
"fields": [
{
"type": "integer",
"name": "a"
},
{
"type": "integer",
"name": "b"
}
]
}
}
Does that sound right?
pd.options.display.max_rows
and only send that many rows; Will need to think about if people have set their display.large_repr
to be info..._repr_json_
for now, thoughts on what that should be? IIUC this won't be special like _repr_html_
is and called automatically. We'll have to publish this ourselves, and we can choose the name?@TomAugspurger
don't put this in core/generic.py (the actual table creation), instead pandas.formats.json might be appropriate (but make it clear this is an export only format).
so we already have all of the accessors, you can simply use your translation function.
In [5]: from pandas.types.common import is_integer_dtype, is_timedelta64_dtype, is_string_dtype
In [6]: is_integer_dtype(np.float)
Out[6]: False
In [7]: is_integer_dtype(np.integer)
Out[7]: True
In [8]: is_integer_dtype(np.dtype('m8[ns]'))
Out[8]: False
In [9]: is_timedelta64_dtype(np.dtype('m8[ns]'))
Out[9]: True
In [10]: is_string_dtype(np.dtype('O'))
Out[10]: True
In [11]: is_string_dtype(pandas.types.dtypes.CategoricalDtype())
Out[11]: True
Does the data
field have to be double encoded? We can handle raw JSON across the jupyter messaging spec.
Name: I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically. We'll have to publish this ourselves, and we can choose the name?
_repr_json_
will tell the frontend to render application/json
which in nteract and in the soon to be released notebook provides a tree view of a JSON structure:
I'd like to see this table get published with a custom mimetype. To demonstrate, I took the liberty of taking parts of your function, a fake mimetype (not sure what the official is), and creating a little React component (style would get better after):
The mimetype I used is application/vnd.tableschema.v1+json
and I published it via IPython.display
rather than a repr function since we don't have a precedent for this table type yet.
/cc @minrk @takluyver
Hi @rgbkrk
Addressing some points above and raised in our Gitter channel
(I'm one of the authors of JSON Table Schema and related specs)
application/tableschema+json
I've called it
_repr_json_
for now, thoughts on what that should be? IIUC this won't be special like_repr_html_
is and called automatically.
We do actually look for _repr_json_
:
https://github.com/ipython/ipython/blob/5.1.0/IPython/core/formatters.py#L782
We currently only support single method name:mime-type mapping. This doesn't extend to custom mime-types, though the protocol allows it. I've been planning to add a _repr_mime_
, where the method returns the mime-keyed dict(s), but haven't gotten to it. I thought I opened an issue for it years ago, but maybe only in my brain. I just opened https://github.com/ipython/ipython/issues/10090 for this.
I did open a similarly worded issue in https://github.com/ipython/ipython/issues/10058. :wink: Either way, I would love to have the ability to return mime bundles for a repr.
@TomAugspurger I think this will close #9166 if you make build_table_schema
accessible, e.g.
pandas.io.json.table.build_schema
, certainly not publicly broadcast, but accessible
Closed by #14904
Most helpful comment
We are very happy to make any changes needed to https://github.com/frictionlessdata/jsontableschema-pandas-py in order to support this smoothly, and especially in reference to things like streaming data out of a DataFrame, or limiting the rows from a frame for preview, and so forth.