Pandas: to_json of df with multi-index header, followed by read_json results in incorrectly formatted column headers

Created on 19 Sep 2013  路  18Comments  路  Source: pandas-dev/pandas

cc @Komnomnomnom

I'm using a recent anaconda build on Windows, which includes v 0.12.0 pandas. I'm attempting to use a multi-index header, write it out to a json file, import it and get the same formatted dataframe. The column headers come out stacked vertically instead.

import numpy as np
import pandas as pd
pd.set_option('html',False)

index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
['one', 'two', 'three']],
labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
[0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
names=['foo', 'bar'])
df = pd.DataFrame(np.random.randn(10, 3), index=index,
columns=['A', 'B', 'C'])
dft = df.transpose()
dft
dft.to_json('D:\xxmt.json',orient='split')
dftj=pd.read_json('D:\xxmt.json',orient='split')
dftj

Enhancement IO JSON

Most helpful comment

If you want a more round-trippable JSON format, this is in upcoming 0.20.0.: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0200-enhancements-table-schema

All 18 comments

I should also point out that the json output omits the names info that would be required to restore the df completely (names=['foo', 'bar'])

ok, if this is going to be an enhancement request, let me make another suggestion...
The stored json data should include all the info necessary to restore the same dataframe ... so it shouldn't be necessary for the reader to specify the orient parameter when reading.

@jnorwood so you are using these for 'storage'? JSON is not traditionally a reflexive serialization format. IOW, it takes a schema to ensure deserialization of a serialization will work. (as types can be subtly different, ordering is not guaranteed, etc.). CSV is like this too, SQL is partially like this. HDF5 can be reflexive as it stores the meta data.

The current JSON output does (almost) have the description of a dataframes structure including support for multi-index headers. The current output with orient='split' is already implementing an implied schema based on the dataframe structure. It is missing "names". I think it could also add "orient". Then all that is left is for the reader to read back and restore the info correctly. In general, if pandas is going to support multi-index headers fully, then I think it needs to provide a way to handle it correctly in whatever output you export to and import from. If you just state what the schema is then users can generate the proper format to still take advantage of the multi-index, which is a nice feature of the dataframe.

Here is a sample json output with orient='split'

{"columns":[["foo","one"],["foo","two"],["foo","three"],["bar","one"],["bar","two"],["baz","two"],["baz","three"],["qux","one"],["qux","two"],["qux","three"]],"index":["A","B","C"],"data":[[1.4363830373,-0.3037434653,-1.0501832047,0.3830191489,-0.6174135405,-1.2378511344,-0.6473196664,1.1436025073,0.2124401475,0.5732940926],[0.6103747676,-0.8948600162,0.5987307019,-0.3098951553,-1.659220121,-1.8047032839,1.006122262,-0.9470006317,-0.2870211491,-1.9018432636],[-0.6036721203,0.7321329545,-0.6730598856,0.0215133965,-0.422750222,-0.6237148299,-0.2726430101,-0.7192759505,-0.3399415151,1.6568583209]]}

@jnorwood FYI to_json doesn't really support MultiIndex or higher dim containers like Panel and NDPanel right now. I'm hoping to get a PR in for 0.13 to improve the JSON docs.

Also I don't think any metadata (like the orient) should be included for aiding deserialisation. I'm thinking of the case where you are deserialising outside of pandas, it just causes confusion.

I think there is more confusion without including the orient option. If someone has generated a json document for you to read in pandas, currently you'd have to guess what orient option was originally used during generation so you can enter the correct reader options. If the writer includes the orient option in the json attributes, then the reader can simply retrieve it as it does the other attributes. If some reader doesn't know what the attribute means it can just ignore it, right?

It isn't clear to me what other reader you are trying to support. I'd guess the most common use case would be pandas generating the json and then pandas reading it back. My particular use case is I'd like to generate a json format file or string, and have pandas read and display it as a multi-index table in some cases, or as a single index table in other cases. Of course we could invent another layer of communication to tell pandas what type of table orientation the json file contains, but it seems more natural to just put the info in the json file.

Really, I think you should try to consistently support all the features that pandas supports with all the readers and writers, and do it in a straightforward way that is documented. Someone has apparently made an attempt to support multi-index columns in the json writer, and it seems like it is a good start. It would be a good feature to support a simple json format that other apps could generate and so enable easy description of a multi-index table. It is a lot simpler solution than using hdf5 generation. I can see the potential importance of both options.

One point in implied by @jnorwood stands out: what's the use-case for to_json? I'd assume it's less space efficient than other formats.

Well my main use case is (well, was more than is) interoperability. Specifically with a web frontend, deserialisation in JS, for plotting etc.

@Komnomnomnom progresss on this?

@jreback as of right now my priorities for pandasjson are MultiIndex support and Panel (+ higher dim) support. No work done in this direction yet but I think 0.14 is a good target to aim for.

So this issue to track MultiIndex support and #5428 for tracking Panel support?

(As stated above I'm not a fan of the idea of adding extra metadata onto the JSON encoded Pandas object. I'd favour a 'clean' representation of the data so it's easily portable. I'd welcome other's opinions though.)

@Komnomnomnom sounds good on the priorities

I also agree with keeping clean json

I think this issue concerns some sort of auto sniffing of the orientation
not sure how hard that would be (and prob would be ambiguous in some cases)

maybe accept orient= 'infer' at some point?

I like the idea of orient=infer on decoding, although yes it would be ambiguous for index and columns formats.

In [24]: df
Out[24]: 
              Open    High     Low   Close   Volume  Adj Close
Date                                                          
2010-01-04  626.95  629.51  624.24  626.75  1956200     626.75
2010-01-05  627.18  627.84  621.54  623.99  3004700     623.99
2010-01-06  625.86  625.86  606.36  608.26  3978700     608.26
2010-01-07  609.40  610.00  592.65  594.10  6414300     594.10
2010-01-08  592.00  603.25  589.11  602.02  4724300     602.02
2010-01-11  604.46  604.46  594.04  601.11  7212900     601.11
2010-01-12  597.65  598.16  588.00  590.48  4853300     590.48
2010-01-13  576.49  588.38  573.90  587.09  6496600     587.09
2010-01-14  583.90  594.20  582.81  589.85  4240100     589.85
2010-01-15  593.34  593.56  578.04  580.00  5434500     580.00

[10 rows x 6 columns]

In [19]: df.to_json(orient='columns')
Out[19]: '{"Open":{"1262563200000":626.95,"1262649600000":627.18,"1262736000000":625.86,"1262822400000":609.4,"1262908800000":592.0,"1263168000000":604.46,"1263254400000":597.65,"1263340800000":576.49,"1263427200000":583.9,"1263513600000":593.34},"High":{"1262563200000":629.51,"1262649600000":627.84,"1262736000000":625.86,"1262822400000":610.0,"1262908800000":603.25,"1263168000000":604.46,"1263254400000":598.16,"1263340800000":588.38,"1263427200000":594.2,"1263513600000":593.56},"Low":{"1262563200000":624.24,"1262649600000":621.54,"1262736000000":606.36,"1262822400000":592.65,"1262908800000":589.11,"1263168000000":594.04,"1263254400000":588.0,"1263340800000":573.9,"1263427200000":582.81,"1263513600000":578.04},"Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0},"Volume":{"1262563200000":1956200.0,"1262649600000":3004700.0,"1262736000000":3978700.0,"1262822400000":6414300.0,"1262908800000":4724300.0,"1263168000000":7212900.0,"1263254400000":4853300.0,"1263340800000":6496600.0,"1263427200000":4240100.0,"1263513600000":5434500.0},"Adj Close":{"1262563200000":626.75,"1262649600000":623.99,"1262736000000":608.26,"1262822400000":594.1,"1262908800000":602.02,"1263168000000":601.11,"1263254400000":590.48,"1263340800000":587.09,"1263427200000":589.85,"1263513600000":580.0}}'

In [22]: df.to_json(orient='index')
Out[22]: '{"1262563200000":{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},"1262649600000":{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},"1262736000000":{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},"1262822400000":{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},"1262908800000":{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},"1263168000000":{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},"1263254400000":{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},"1263340800000":{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},"1263427200000":{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},"1263513600000":{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}}'

In [20]: df.to_json(orient='split')
Out[20]: '{"columns":["Open","High","Low","Close","Volume","Adj Close"],"index":[1262563200000,1262649600000,1262736000000,1262822400000,1262908800000,1263168000000,1263254400000,1263340800000,1263427200000,1263513600000],"data":[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]}'

In [21]: df.to_json(orient='records')
Out[21]: '[{"Open":626.95,"High":629.51,"Low":624.24,"Close":626.75,"Volume":1956200.0,"Adj Close":626.75},{"Open":627.18,"High":627.84,"Low":621.54,"Close":623.99,"Volume":3004700.0,"Adj Close":623.99},{"Open":625.86,"High":625.86,"Low":606.36,"Close":608.26,"Volume":3978700.0,"Adj Close":608.26},{"Open":609.4,"High":610.0,"Low":592.65,"Close":594.1,"Volume":6414300.0,"Adj Close":594.1},{"Open":592.0,"High":603.25,"Low":589.11,"Close":602.02,"Volume":4724300.0,"Adj Close":602.02},{"Open":604.46,"High":604.46,"Low":594.04,"Close":601.11,"Volume":7212900.0,"Adj Close":601.11},{"Open":597.65,"High":598.16,"Low":588.0,"Close":590.48,"Volume":4853300.0,"Adj Close":590.48},{"Open":576.49,"High":588.38,"Low":573.9,"Close":587.09,"Volume":6496600.0,"Adj Close":587.09},{"Open":583.9,"High":594.2,"Low":582.81,"Close":589.85,"Volume":4240100.0,"Adj Close":589.85},{"Open":593.34,"High":593.56,"Low":578.04,"Close":580.0,"Volume":5434500.0,"Adj Close":580.0}]'


In [23]: df.to_json(orient='values')
Out[23]: '[[626.95,629.51,624.24,626.75,1956200.0,626.75],[627.18,627.84,621.54,623.99,3004700.0,623.99],[625.86,625.86,606.36,608.26,3978700.0,608.26],[609.4,610.0,592.65,594.1,6414300.0,594.1],[592.0,603.25,589.11,602.02,4724300.0,602.02],[604.46,604.46,594.04,601.11,7212900.0,601.11],[597.65,598.16,588.0,590.48,4853300.0,590.48],[576.49,588.38,573.9,587.09,6496600.0,587.09],[583.9,594.2,582.81,589.85,4240100.0,589.85],[593.34,593.56,578.04,580.0,5434500.0,580.0]]'

Getting index and columns confused would be a transpose of the data so either raise if one of these orients were inferred or do some fancy thinking, like date data being primarily on the index axis, and the index size is usually larger than the column size etc. Unfortunately index is the default orientation.

Last update to this was 2 years ago. I'm assuming its still open? I have a multi-index situation when I do:
query_df = pd.read_sql_query(sql_text, db_conn, index_col=['user_id', 'mobile_time'])
and then output via:
records_json = query_df.to_json(orient='split')

but then
pd.read_json(records_json, orient='split')
doesn't work

If you want a more round-trippable JSON format, this is in upcoming 0.20.0.: http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#whatsnew-0200-enhancements-table-schema

Or you can just write json with orient = 'table'

df.to_json(path_or_buf='test.json', orient='table')

read multi_index json

pd.read_json('test.json', orient='table')

Here is a fancy hack if you have a good reason not to use orient='table' -- in my case it is 3x bigger than orient='split'.

Notes:

  • index level names are lost by orient='split', even with this
  • it will probably break as soon as you use non-basic types (think: dates).
def to_json_split_roundtrip(df):
    if isinstance(df.index, pd.MultiIndex):
        values = [repr(v) for v in df.index.ravel()]
        df = df.set_index(pd.Index(values))
    return df.to_json(orient="split")

def from_json_split_roundtrip(j):
    df = pd.read_json(j, orient="split")
    if df.index[0].startswith('('):
        df.index = pd.MultiIndex.from_tuples((ast.literal_eval(t) for t in df.index))
    return df

Example use

df=pd.DataFrame([['1',2,3,4],['1',4,5,6],['7',4,9,10]],columns=["a","b","c","d"]).groupby(["a","b"]).sum()

display(df)
print(df.index)

as_json = to_json_split_roundtrip(df)
# send it around
r = from_json_split_roundtrip(as_json)
display(r)
print(r.index)

image

So, a built-in solution (and/or a more efficient orient='table') is still sorely needed.

There seems to be a mixup in handling the multiindex columns of read_json and to_json. As a solution you can manually rearrange the multiindex column order in the json data:

>>> df = pd.DataFrame({'a':[1,2],'b':[3,4]})
>>> df.columns = pd.MultiIndex.from_tuples([('A','B',),('C','D',)])
>>> print(df)
   A  C
   B  D
0  1  3
1  2  4
>>> json_string = df.to_json(orient='split')
>>> print(json_string)
{"columns":[["A","B"],["C","D"]],"index":[0,1],"data":[[1,3],[2,4]]}
>>> wrong_df = pd.read_json(json_string,orient='split')
>>> print(wrong_df)
   A  B
   C  D
0  1  3
1  2  4
>>> correct_columns = pd.MultiIndex.from_tuples([tuple(c) for c in json.loads(json_string)['columns']])
>>> print(correct_columns)
MultiIndex(levels=[['A', 'C'], ['B', 'D']],
           labels=[[0, 1], [0, 1]])
>>> json_dict = json.loads(json_string)
>>> json_dict['columns'] = list(range(len(json_dict['columns'])))

>>> correct_df = pd.read_json(json.dumps(json_dict),orient='split')
>>> correct_df.columns = correct_columns
>>> print(correct_df)
   A  C
   B  D
0  1  3
1  2  4
Was this page helpful?
0 / 5 - 0 ratings