Pandas: Groupby + sum by multiple columns on an empty DataFrame drops list of columns

Created on 11 Jan 2017  路  8Comments  路  Source: pandas-dev/pandas

Code Sample

import pandas as pd

print pd.DataFrame(data=[[1,2,3]], columns=['A', 'B', 'C'])\
    .groupby(['A', 'B'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['A', 'B', 'C']

print pd.DataFrame(data=[], columns=['A', 'B', 'C'])\
    .groupby(['A'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['A', 'B', 'C']

print pd.DataFrame(data=[], columns=['A', 'B', 'C'])\
    .groupby(['A', 'B'])\
    .sum()\
    .reset_index()\
    .columns\
    .tolist()
# ['index']

Problem description

As the original list of columns is lost in the second case, I have to handle empty data frames differently, or add columns back by myself, both of which are inconvenient.

Expected Output

The list of columns is expected to be equal to the original one from data frame

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-57-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.2
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.1.4
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None
pandas_datareader: None

Bug Groupby Indexing Reshaping

Most helpful comment

It is not consistent more generally with apply (my pandas version 0.23.4):

>>> pd.DataFrame(data=[[1,2,3]], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
   A  B  C
0  1  2  3
>>> pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
Empty DataFrame
Columns: []
Index: []

All 8 comments

so to simplify

# this is ok
In [18]: pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A','B']).C.sum()
Out[18]: Series([], Name: C, dtype: float64)

# this should be Index(['C'])
In [19]: pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A','B']).sum().columns
Out[19]: Index([], dtype='object')

if you would like to trace and submit a PR to fix would be great!

As a pointer, this is probably related to our automatically dropping nuisance columns (non-numeric cols like object) in numeric aggregations. Explicitly setting the dtypes works:

In [74]: pd.DataFrame([], columns=["A", "B", "C"]).astype(np.int64).groupby(['A', 'B']).sum().reset_index().columns.tolist()
Out[74]: ['A', 'B', 'C']

I wonder if this means it's not actually a bug? We are "correctly" dropping an object column after all.
But the original examples 2 and 3 do seem inconsistent.

@TomAugspurger

so this is a special case on sum

In [4]: pd.DataFrame([], columns=["A", "B", "C"]).groupby(['A','B']).mean()
DataError: No numeric types to aggregate

which 'works' on object dtypes, so they in fact are NOT nuiscance columns.

The reason for this is the fallback to np.sum, again which 'works' on object dtypes.

So this is 'correct'. Though not helpful.

Agreed with @jreback. mean() catches GroupByError and re-raises:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1031

sum() does not:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L123

DataError inherits from GroupByError, which in turn inherits from Exception. So sum() invokes the fall-back np.sum() while mean() exits immediately. Should _groupby_function() explicitly catch and re-raise GroupByError as well?

@chrisaycock yeah I tried to fix this at some point. The issue is that we rely on np.sum in some cases. So this should be re-engineered a bit.

It is not consistent more generally with apply (my pandas version 0.23.4):

>>> pd.DataFrame(data=[[1,2,3]], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
   A  B  C
0  1  2  3
>>> pd.DataFrame(data=[], columns=['A', 'B', 'C']).groupby(['A', 'B']).apply(lambda x:x)
Empty DataFrame
Columns: []
Index: []

Any news? I have @valentas-kurauskas 's issue.

The problem @valentas-kurauskas and @kuraga reported seems to be that, since the dataframe is empty, the function passed to apply is never called.
apply can't know the function's return type (which can be either of scaler, Series or DataFrame) so it can only drop the structure.

Was this page helpful?
0 / 5 - 0 ratings