Pandas: BUG: to_excel to throw IndexError for dataframe with empty (row) MultiIndex

Created on 5 Feb 2018 · 4Comments · Source: pandas-dev/pandas

mi = pd.MultiIndex.from_arrays([
        np.array(['a', 'a', 'b', 'b']), 
        np.array(['1', '2', '2', '3']), 
        np.array(['alpha', 'beta', 'alpha', 'beta']),
    ], names=['one', 'two', 'three'])
df = pd.DataFrame(np.random.rand(4, 3), index=mi)
df2 = df.loc[('b', '1', slice(None)), :]
print(df2.index)

Output:

MultiIndex(levels=[['a', 'b'], ['1', '2', '3'], ['alpha', 'beta']],
           labels=[[], [], []],
           names=['one', 'two', 'three'])

The next line causes an IndexError:

df2.to_excel('test.xlsx')

Notice that the 'empty' MultiIndex does not give any indication of its presence when the entire dataframe is printed, looking like an ordinary empty dataframe instead.

Empty DataFrame
Columns: [0, 1, 2]
Index: []

Problem description

I'm not quite sure which behaviour is the problematic one here: the fact that the first block of code creates an 'empty' MultiIndex that has values in levels, or that to_excel doesn't write a DataFrame with such an index to file properly. The 'empty' MultiIndex behaviour is a little bit anomalous by itself, because a similar block of code (below) will cause a KeyError. The KeyError is 'suppressed' by the fact that there is a third level in the MultiIndex.

mi = pd.MultiIndex.from_arrays([
        np.array(['a', 'a', 'b', 'b']), 
        np.array(['1', '2', '2', '3']), 
    ], names=['one', 'two'])
df = pd.DataFrame(np.random.rand(4, 3), index=mi)
df2 = df.loc[('b', '1'), :]

I initially ran into this behaviour (the IndexError) when using xs, but since xs seems to be headed for the deprecation guillotine, I figured I'd rewrite this with loc instead.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-43-Microsoft
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: None
openpyxl: 2.4.9
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Bug Error Reporting IO Excel

Source

lauhayden

👍1

Most helpful comment

OK, I edited the title to reflect this. Probably the most sensible thing to do is to be consistent with empty dataframes in general (writing only the column headers to excel)

PR welcome!

jorisvandenbossche on 6 Feb 2018

👍3

All 4 comments

I'm not quite sure which behaviour is the problematic one here: the fact that the first block of code creates an 'empty' MultiIndex that has values in levels,

This is normal and expected behaviour. When taking a subset of a MultiIndex, the levels are not re-computed but kept intact (the fact that it is empty is expressed in its empty labels).

or that to_excel doesn't write a DataFrame with such an index to file properly.

This could be considered a bug, given that for an empty dataframe without multiindex, this works. Or at least it should give a better error message.

The fact that df.loc[('b', '1', slice(None)), :] does not raise an error while df.loc[('b', '1'), :] does, I am not fully sure about the explanation. But I suppose it is because of having the slice(None) makes this seen as "indexing an MI with multiple indexers (one for each level)", while in the other case you are selecting only single labels (up to a certain level). But I agree this is a confusing case.
If we want to discuss this further, let's open a separate issue for it.

jorisvandenbossche on 6 Feb 2018

Alright, so this issue should be focused on to_excel throwing IndexError. I'll open a separate issue for the MultiIndex slicing behaviour.

lauhayden on 6 Feb 2018

OK, I edited the title to reflect this. Probably the most sensible thing to do is to be consistent with empty dataframes in general (writing only the column headers to excel)

PR welcome!

jorisvandenbossche on 6 Feb 2018

👍3

Hello,
Not a lot of added value ... but the code below helps to easy reproduce this:
pd.DataFrame([[],[],[]]).T.set_index([0, 1]).to_excel('test.xlsx')