# Your code here
df = pd.DataFrame([['A1', 'B1', 0, 3],
['A1', 'B1', 1, 2],
['A2', 'B1', 2, 6],
['A2', 'B1', 3, 9],
['A2', 'B1', 4, 2]
], columns=['A', 'B', 'C', 'D'])
df['B'] = pd.Categorical(df.B)
df.groupby(['A', 'B', 'C']).sum()
I would expect the above code to produce the same shape of (5, 1) as if column B wasn't converted to categorical format. However, this returns a dataframe of (10, 1).
This is a serious problem if your data is big and you are grouping by multiple categorical keys. The returning dataframe would grow too big and eventually result in MemoryError.
I am getting this:

However, I expect to get this:

pd.show_versions()commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-327.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: en_IE.UTF-8
pandas: 0.20.3
pytest: 3.2.1
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: None
xlwt: None
xlsxwriter: 0.9.8
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
@cissy7125 you can simply use this
df.groupby(['A', 'B', 'C']).sum().dropna()
If you have a big dataframe, with several keys, you can easily use up your memory before dropna can be executed. Let's say you have 10 keys, and each has 10 different categories; you will end up 10^10 number of rows.
Isn't this behavior corrected in 0.23 ?
You are still running with 0.20.3
yes u can pass observed=True on 0.23.0
see http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#new-observed-keyword-for-excluding-unobserved-categories-in-groupby
Most helpful comment
yes u can pass observed=True on 0.23.0
see http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#new-observed-keyword-for-excluding-unobserved-categories-in-groupby