Hello.
I tried to save a dataframe with MultiIndex used as columns to a CSV file and load it back, but I had no luck.
# Create a fame with multiindex columns
frame = pd.DataFrame({('AAPL', 'OPEN'): [1, 2, 3, 4], ('AAPL', 'CLOSE'): [1, 2, 3, 4], ('MSFT', 'OPEN'): [1, 2, 3, 4], ('MSFT', 'CLOSE'): [1, 2, 3, 4]})
# Make sure it was created as wanted.
frame
# AAPL MSFT
# CLOSE OPEN CLOSE OPEN
# 0 1 1 1 1
# 1 2 2 2 2
# 2 3 3 3 3
# 3 4 4 4 4
# Try to convert the frame to CSV
s1 = frame.to_csv()
s2 = frame.to_csv(tupleize_cols=True)
# FutureWarning displayed - tupleize_cols is deprecated.
print(s1)
# ,AAPL,AAPL,MSFT,MSFT
# ,CLOSE,OPEN,CLOSE,OPEN
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4
print(s2)
# ,"('AAPL', 'CLOSE')","('AAPL', 'OPEN')","('MSFT', 'CLOSE')","('MSFT', 'OPEN')"
# 0,1,1,1,1
# 1,2,2,2,2
# 2,3,3,3,3
# 3,4,4,4,4
# Read the CSV strings back to DataFrames
f1 = pd.read_csv(StringIO(s1))
f2 = pd.read_csv(StringIO(s2), tupleize_cols=True)
# Warning about tupleize_cols here
# Both frames does not look like the original one.
f1
# Unnamed: 0 AAPL AAPL.1 MSFT MSFT.1
# 0 NaN CLOSE OPEN CLOSE OPEN
# 1 0.0 1 1 1 1
# 2 1.0 2 2 2 2
# 3 2.0 3 3 3 3
# 4 3.0 4 4 4 4
f2
# Unnamed: 0 ('AAPL', 'CLOSE') ('AAPL', 'OPEN') ('MSFT', 'CLOSE') ('MSFT', 'OPEN')
# 0 0 1 1 1 1
# 1 1 2 2 2 2
# 2 2 3 3 3 3
# 3 3 4 4 4 4
As you see, both frames don't have multiindexed columns as original one. So, how should I save a DataFrame with multiindexed columns to CSV file and load it back to get a frame same to the original one?
I also tried to save as JSON, but also encountered problems. Here is what the frame shown above is converted to.
frame.to_json()
'{"["AAPL","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["AAPL","OPEN"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","CLOSE"]":{"0":1,"1":2,"2":3,"3":4},"["MSFT","OPEN"]":{"0":1,"1":2,"2":3,"3":4}}'
So, tupleized multiindexed column names are obviously incorrectly quoted.
With best regards,
Alex.
commit: None
python: 3.4.2.final.0
python-bits: 32
OS: Linux
OS-release: 3.16.0-6-686-pae
machine: i686
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0.dev0+318.g272bbdc
pytest: 3.6.3
pip: 1.5.6
setuptools: 5.5.1
Cython: 0.28.4
numpy: 1.14.5
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
I tried to experiment more with JSON, but found the only way to produce a valid JSON output is to specify orient='split', but it is still read incorrectly.
s = frame.to_json(orient='split')
# s = '{"columns":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"index":[0,1,2,3],"data":[[1,1,1,1],[2,2,2,2],[3,3,3,3],[4,4,4,4]]}'
pd.read_json(StringIO(s), orient='split')
# AAPL CLOSE
# AAPL OPEN
# MSFT CLOSE
# MSFT OPEN
# 0 1 1
# 1 2 2
# 2 3 3
# 3 4 4
# This is still not what I wanted. Here is what this frame uses as columns.
pd.read_json(StringIO(s), orient='split').columns
# MultiIndex(levels=[['AAPL', 'CLOSE'], ['AAPL', 'OPEN'], ['CLOSE', 'MSFT'], ['MSFT', 'OPEN']],
# labels=[[0, 1], [0, 1], [1, 0], [0, 1]])
# While this is the original one.
frame.columns
# MultiIndex(levels=[['AAPL', 'MSFT'], ['CLOSE', 'OPEN']],
# labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
So far I don't know any way to save a dataframe with multiindexed columns to either CSV or JSON and load it back properly...
I suspect you'll have trouble with this in most storage formats, since hierarchical columns are somewhat unique to pandas. You may be best of manually flattening your columns before and after IO.
The best be here may be orient="table" though I'm not sure what the JSON Table Schema specification says about hierarchical columns (indices are fine). Any investigation or PRs there are certainly welcome
Here's a workaround to maintain those via transposition. You need a non-numeric index to account for the fact that numeric column labels are not valid in the JSON Table schema:
>>> frame.index = list('abcd')
>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
AAPL MSFT
OPEN CLOSE OPEN CLOSE
a 1 1 1 1
b 2 2 2 2
c 3 3 3 3
d 4 4 4 4
IIRC JSON Table Schema requires string column labels, so hierarchical
columns are probably out of scope for orient='table'.
On Thu, Jul 19, 2018 at 10:21 AM William Ayd notifications@github.com
wrote:
The best be here may be orient="table" though I'm not sure what the
JSON Table Schema specification says about hierarchical columns (indices
are fine). Any investigation or PRs there are certainly welcomeHere's a workaround to maintain those via transposition. You need a
non-numeric index to account for the fact that numeric column numbers are
not valid in the JSON Table schema:frame.index = list('abcd')>>> pd.read_json(frame.T.to_json(orient="table"), orient="table").T
AAPL MSFT
OPEN CLOSE OPEN CLOSE
a 1 1 1 1
b 2 2 2 2
c 3 3 3 3
d 4 4 4 4—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/21976#issuecomment-406314820,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABQHIssvUIDVN1FULaNUytXbUXsk-HELks5uIKPlgaJpZM4VWM9O
.
read_csv can handle hierarchical columns - but they must be specified on the way in with a list to the header arg.
```python
from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)
In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]:
AAPL MSFT
CLOSE OPEN CLOSE OPEN
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
```
Could we use some trickery like using a \xa0 to denote what “columns
Are actually part of an MI?
Thank you, CSV version works fine, while JSON version still requires frame index to be a string list and crashes if timestamps or integers are used as index.
Additionally, it's looking like even frames with multi-level row indexes are stored in JSON a bit incorrectly if orient='split' used.
In [30]: frame
Out[30]:
0 1 2 3
AAPL CLOSE 1 2 3 4
OPEN 1 2 3 4
MSFT CLOSE 1 2 3 4
OPEN 1 2 3 4
In [31]: pd.read_json(StringIO(frame.to_json(orient='split')), orient='split')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Shape of passed values is (4, 4), indices imply (4, 2)
In [32]: frame.to_json(orient='split')
Out[32]: '{"columns":[0,1,2,3],"index":[["AAPL","CLOSE"],["AAPL","OPEN"],["MSFT","CLOSE"],["MSFT","OPEN"]],"data":[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]}'
Can you open a separate bug for the JSON orient="split" issue? That does seem off.
As a side note on orient="table":
Most helpful comment
read_csvcan handle hierarchical columns - but they must be specified on the way in with a list to theheaderarg.```python
from io import StringIO
buf = StringIO()
frame.to_csv(buf)
buf.seek(0)
In [109]: pd.read_csv(buf, header=[0,1], index_col=0)
Out[109]:
AAPL MSFT
CLOSE OPEN CLOSE OPEN
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
```