Pandas: How should I store frames with multiindex columns in CSV?

Created on 19 Jul 2018  Â·  8Comments  Â·  Source: pandas-dev/pandas

Hello.

I tried to save a dataframe with MultiIndex used as columns to a CSV file and load it back, but I had no luck.

# Create a fame with multiindex columns
frame = pd.DataFrame({('AAPL', 'OPEN'): [1, 2, 3, 4], ('AAPL', 'CLOSE'): [1, 2, 3, 4], ('MSFT', 'OPEN'): [1, 2, 3, 4], ('MSFT', 'CLOSE'): [1, 2, 3, 4]})
# Make sure it was created as wanted.
frame
#   AAPL       MSFT     
#   CLOSE OPEN CLOSE OPEN
# 0     1    1     1    1
# 1     2    2     2    2
# 2     3    3     3    3
# 3     4    4     4    4

# Try to convert the frame to CSV
s1 = frame.to_csv()
s2 = frame.to_csv(tupleize_cols=True)
# FutureWarning displayed - tupleize_cols is deprecated.

print(s1)
# ,AAPL,AAPL,MSFT,MSFT
# ,CLOSE,OPEN,CLOSE,OPEN
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

print(s2)
# ,"('AAPL', 'CLOSE')","('AAPL', 'OPEN')","('MSFT', 'CLOSE')","('MSFT', 'OPEN')"
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4

# Read the CSV strings back to DataFrames
f1 = pd.read_csv(StringIO(s1))
f2 = pd.read_csv(StringIO(s2), tupleize_cols=True)
# Warning about tupleize_cols here

# Both frames does not look like the original one.
f1
#    Unnamed: 0   AAPL AAPL.1   MSFT MSFT.1
# 0         NaN  CLOSE   OPEN  CLOSE   OPEN
# 1         0.0      1      1      1      1
# 2         1.0      2      2      2      2
# 3         2.0      3      3      3      3
# 4         3.0      4      4      4      4

f2
#    Unnamed: 0  ('AAPL', 'CLOSE')  ('AAPL', 'OPEN')  ('MSFT', 'CLOSE')  ('MSFT', 'OPEN')
# 0           0                  1                 1                  1                 1
# 1           1                  2                 2                  2                 2
# 2           2                  3                 3                  3                 3
# 3           3                  4                 4                  4                 4

As you see, both frames don't have multiindexed columns as original one. So, how should I save a DataFrame with multiindexed columns to CSV file and load it back to get a frame same to the original one?

I also tried to save as JSON, but also encountered problems. Here is what the frame shown above is converted to.

frame.to_json()
'{"["AAPL","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["AAPL","OPEN"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","OPEN"]":{"0":1,"1":2,"2":3,"3":4}}'

So, tupleized multiindexed column names are obviously incorrectly quoted.

With best regards,

Alex.

INSTALLED VERSIONS

commit: None
python: 3.4.2.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-6-686-pae
machine: i686
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+318.g272bbdc
pytest: 3.6.3
pip: 1.5.6
setuptools: 5.5.1
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

IO Data

Most helpful comment

read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.

```python
from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)

In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]:
AAPL MSFT
CLOSE OPEN CLOSE OPEN
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
```

All 8 comments

I tried to experiment more with JSON, but found the only way to produce a valid JSON output is to specify orient='split', but it is still read incorrectly.

s = frame.to_json(orient='split')
# s = '{"columns":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"index":[0,1,2,3],"data":[[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4]]}'

pd.read_json(StringIO(s), orient='split')
#  AAPL CLOSE
#  AAPL  OPEN
#  MSFT CLOSE
#  MSFT  OPEN
# 0    1     1
# 1    2     2
# 2    3     3
# 3    4     4

# This is still not what I wanted. Here is what this frame uses as columns.
pd.read_json(StringIO(s), orient='split').columns
# MultiIndex(levels=[['AAPL', 'CLOSE'], ['AAPL', 'OPEN'], ['CLOSE', 'MSFT'], ['MSFT', 'OPEN']],
#           labels=[[0, 1], [0, 1], [1, 0], [0, 1]])

# While this is the original one.
frame.columns
# MultiIndex(levels=[['AAPL', 'MSFT'], ['CLOSE', 'OPEN']],
#           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

So far I don't know any way to save a dataframe with multiindexed columns to either CSV or JSON and load it back properly...

I suspect you'll have trouble with this in most storage formats, since hierarchical columns are somewhat unique to pandas. You may be best of manually flattening your columns before and after IO.

The best be here may be orient="table" though I'm not sure what the JSON Table Schema specification says about hierarchical columns (indices are fine). Any investigation or PRs there are certainly welcome

Here's a workaround to maintain those via transposition. You need a non-numeric index to account for the fact that numeric column labels are not valid in the JSON Table schema:

>>> frame.index = list('abcd')
>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
  AAPL       MSFT      
  OPEN CLOSE OPEN CLOSE
a    1     1    1     1
b    2     2    2     2
c    3     3    3     3
d    4     4    4     4

IIRC JSON Table Schema requires string column labels, so hierarchical
columns are probably out of scope for orient='table'.

On Thu, Jul 19, 2018 at 10:21 AM William Ayd notifications@github.com
wrote:

The best be here may be orient="table" though I'm not sure what the
JSON Table Schema specification says about hierarchical columns (indices
are fine). Any investigation or PRs there are certainly welcome

Here's a workaround to maintain those via transposition. You need a
non-numeric index to account for the fact that numeric column numbers are
not valid in the JSON Table schema:

frame.index = list('abcd')>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
AAPL MSFT
OPEN CLOSE OPEN CLOSE
a 1 1 1 1
b 2 2 2 2
c 3 3 3 3
d 4 4 4 4

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21976#issuecomment-406314820,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIssvUIDVN1FULaNUytXbUXsk-HELks5uIKPlgaJpZM4VWM9O
.

read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.

```python
from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)

In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]:
AAPL MSFT
CLOSE OPEN CLOSE OPEN
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
```

Could we use some trickery like using a \xa0 to denote what “columns
Are actually part of an MI?

Thank you, CSV version works fine, while JSON version still requires frame index to be a string list and crashes if timestamps or integers are used as index.

Additionally, it's looking like even frames with multi-level row indexes are stored in JSON a bit incorrectly if orient='split' used.

In [30]: frame
Out[30]: 
            0  1  2  3
AAPL CLOSE  1  2  3  4
     OPEN   1  2  3  4
MSFT CLOSE  1  2  3  4
     OPEN   1  2  3  4

In [31]: pd.read_json(StringIO(frame.to_json(orient='split')), orient='split')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
...
ValueError: Shape of passed values is (4, 4), indices imply (4, 2)

In [32]: frame.to_json(orient='split')
Out[32]: '{"columns":[0,1,2,3],"index":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"data":[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]}'

Can you open a separate bug for the JSON orient="split" issue? That does seem off.

As a side note on orient="table":

  • Timestamp support is being added in #21827
  • Integers are not valid column labels but should be fine as an index; again if you have an example you can provide please open as a bug in a separate issue
Was this page helpful?
0 / 5 - 0 ratings

Related issues

Ashutosh-Srivastav picture Ashutosh-Srivastav  Â·  3Comments

andreas-thomik picture andreas-thomik  Â·  3Comments

ebran picture ebran  Â·  3Comments

venuktan picture venuktan  Â·  3Comments

nathanielatom picture nathanielatom  Â·  3Comments