Pandas: Support exporting to JSON Table Schema

Created on 10 Oct 2016  路  18Comments  路  Source: pandas-dev/pandas

For Jupyter based frontends, we would love to see a common tabular format in JSON that we can render (in addition to or in lieu of the current HTML). This would provide us the flexibility to style and format according to data type, as well as have better hooks for theming of tabular data on frontends. Everyone has an opinion, let's give them flexibility to apply it.

It's important to us to support a common JSON format so that for R, Julia, and other languages also can display their DataFrames with similar formatting and styling out of the box.

The best one I've seen so far, with a great amount of discussion and collaboration, is the JSON Table Schema.

Update: In order to include both data + schema, we're using data resource which has media type application/vnd.dataresource+json.

/cc @captainsafia @ellisonbg @jreback @TomAugspurger

Enhancement IO JSON

Most helpful comment

We are very happy to make any changes needed to https://github.com/frictionlessdata/jsontableschema-pandas-py in order to support this smoothly, and especially in reference to things like streaming data out of a DataFrame, or limiting the rows from a frame for preview, and so forth.

All 18 comments

xref #9146, #9166

I'll dig into the schema later, but just to make sure: the basic idea is for pandas to publish multiple outputs (application/html, application/json) wherever we publish just the HTML right now?
More concretely, what changes do we need to make to Series / DataFrames / Indexes to support this? IIRC there isn't a _repr_json_ equivalent of _repr_html_.

Interesting - I just noticed they wrote a wrapper for pandas: https://github.com/frictionlessdata/jsontableschema-pandas-py

On the JupyterLab, notebook, and nteract side, we'd have https://github.com/frictionlessdata/jsontableschema-js to lean on.

the basic idea is for pandas to publish multiple outputs (application/html, application/json) wherever we publish just the HTML right now?

Yes. The media type (mime type in Jupyter parlance) would be something like application/vnd.table-schema.v1+json.

While there's not a repr for arbitrary media types in IPython (we can evolve that as a result of this discussion), there is a way to display raw messages with IPython.display.display:

IPython.display.display({
    'application/json': releases
}, raw=True)

Which shows up in nteract as:

screen shot 2016-10-10 at 10 39 25 am

Hi. I'm one of the authors of JSON Table Schema, and also part of the team working on reference implementations for this and the related family of specs. The JavaScript implementation is just a little behind the Python one, and probably also of relevance here.

Happy to help.

_edit_: added link to the JavaScript implementation, in addition to the Python one previously linked.

By the way, on the nteract and jupyterlab side, it's pretty easy for us to iterate with new renderers and media types.

I don't really see a reason not to add this in pandas; The additional code shouldn't be too much of a burden.

Would clients expect to receive the entire DataFrame, and do their own truncation? I worry a bit about the overhead of publishing huge DataFrames. I would say follow the options in pd.options.display.max_rows, etc. and only ship over some of the DataFrame (but need some way of saying that there's more...)

A few things directly related to the spec that pandas might have trouble with:

  • field descriptors: in principal _metadata should carry this, but IIRC we don't have a good story on propagating that though operations, so it's liable to be dropped
  • field types: shouldn't have any problems here
  • primary key: Typically this would be the (multi)Index, but we don't require uniqueness on that.
  • field names: Somewhat rare, but we can have MultiIndexes in the columns, so we could have "multiple rows" of field names; These can be collapsed down to tuples.

We are very happy to make any changes needed to https://github.com/frictionlessdata/jsontableschema-pandas-py in order to support this smoothly, and especially in reference to things like streaming data out of a DataFrame, or limiting the rows from a frame for preview, and so forth.

Started on this here: https://github.com/pandas-dev/pandas/compare/master...TomAugspurger:json-schema very early, one test, no docs :D

Some design things I'd like to nail down before submitting a PR:

  • The actual message published to the jupyter channel will be
{'schema': schema, 'data': data}

where schema is a valid JSON table schema and data is like pd.DataFrame.to_json(orient='records')

{
  "data": "[{\"a\":3,\"b\":3},{\"a\":0,\"b\":2},{\"a\":1,\"b\":1},{\"a\":3,\"b\":0},{\"a\":3,\"b\":1}]",
  "schema": {
    "fields": [
      {
        "type": "integer",
        "name": "a"
      },
      {
        "type": "integer",
        "name": "b"
      }
    ]
  }
}

Does that sound right?

  • Truncation: I think we'll follow pd.options.display.max_rows and only send that many rows; Will need to think about if people have set their display.large_repr to be info...
  • Name: I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically. We'll have to publish this ourselves, and we can choose the name?
  • @jreback all this stuff I'm doing here, do we already have a simpler way of going from type to a "base" type. I don't want to have to worry about int16 vs int32, etc.
  • Speaking of types, pandas doesn't have a string type, so right now we send those over as "any". :( Do we want to do a bit of inference to maybe send those as strings, or leave that to the client? pandas 2 will have a string type, but that'll be a bit.
  • Indexes: When should we send them?

    1. Always

    2. When any (or all) of the levels are named

@TomAugspurger

don't put this in core/generic.py (the actual table creation), instead pandas.formats.json might be appropriate (but make it clear this is an export only format).

so we already have all of the accessors, you can simply use your translation function.

In [5]: from pandas.types.common import is_integer_dtype, is_timedelta64_dtype, is_string_dtype

In [6]: is_integer_dtype(np.float)
Out[6]: False

In [7]: is_integer_dtype(np.integer)
Out[7]: True

In [8]: is_integer_dtype(np.dtype('m8[ns]'))
Out[8]: False

In [9]: is_timedelta64_dtype(np.dtype('m8[ns]'))
Out[9]: True

In [10]: is_string_dtype(np.dtype('O'))
Out[10]: True

In [11]: is_string_dtype(pandas.types.dtypes.CategoricalDtype())
Out[11]: True

Does the data field have to be double encoded? We can handle raw JSON across the jupyter messaging spec.

Name: I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically. We'll have to publish this ourselves, and we can choose the name?

_repr_json_ will tell the frontend to render application/json which in nteract and in the soon to be released notebook provides a tree view of a JSON structure:

screen shot 2016-12-12 at 5 31 09 pm

I'd like to see this table get published with a custom mimetype. To demonstrate, I took the liberty of taking parts of your function, a fake mimetype (not sure what the official is), and creating a little React component (style would get better after):

screen shot 2016-12-12 at 5 29 22 pm

The mimetype I used is application/vnd.tableschema.v1+json and I published it via IPython.display rather than a repr function since we don't have a precedent for this table type yet.

/cc @minrk @takluyver

Hi @rgbkrk

Addressing some points above and raised in our Gitter channel

(I'm one of the authors of JSON Table Schema and related specs)

  1. Mime types: See my notes in here. I'm working on this right now (meaning, making the submission for the new mime types today). We'll be submitting application/tableschema+json
  2. jsontableschema-js is npm installable, has feature parity with jsontableschema-py
  3. Just FYI, I'm currently on a sprint to close a range of issues and publish v1 of all our specs before end of year, and IETF RFC submissions follow immediately. There are other aspects there that are relevant here (e.g.: "Tabular Data Resource" specification), but I can go over them with you (if you like) after we release v1

I've called it _repr_json_ for now, thoughts on what that should be? IIUC this won't be special like _repr_html_ is and called automatically.

We do actually look for _repr_json_:

https://github.com/ipython/ipython/blob/5.1.0/IPython/core/formatters.py#L782

We currently only support single method name:mime-type mapping. This doesn't extend to custom mime-types, though the protocol allows it. I've been planning to add a _repr_mime_, where the method returns the mime-keyed dict(s), but haven't gotten to it. I thought I opened an issue for it years ago, but maybe only in my brain. I just opened https://github.com/ipython/ipython/issues/10090 for this.

I did open a similarly worded issue in https://github.com/ipython/ipython/issues/10058. :wink: Either way, I would love to have the ability to return mime bundles for a repr.

@TomAugspurger I think this will close #9166 if you make build_table_schema accessible, e.g.

pandas.io.json.table.build_schema , certainly not publicly broadcast, but accessible

Closed by #14904

Was this page helpful?
0 / 5 - 0 ratings

Related issues

matthiasroder picture matthiasroder  路  3Comments

Abrosimov-a-a picture Abrosimov-a-a  路  3Comments

hiiwave picture hiiwave  路  3Comments

MatzeB picture MatzeB  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments