Pandas: .groupby() .value_counts() incompatible with .reset_index() in 0.18.1

Created on 16 Aug 2016 · 2Comments · Source: pandas-dev/pandas

Code Sample

df = pd.DataFrame([[0,1],[0,1],[0,2],[1,1]], columns=['a','b'])
df
   a  b
0  0  1
1  0  1
2  0  2
3  1  1

df.groupby('a').b.value_counts().reset_index()

ValueError: cannot insert b, already exists

#### Expected Output In version 0.18.0, the output was:

   a  b  0
0  0  1  2
1  0  2  1
2  1  1  1
dtype: int64

The difference is that now the groupby() value_counts() operation returns a Series named equivalently to the column on which value_counts() was computed.

df.groupby('a').b.value_counts()

0.18.0

a  b
0  1    2
   2    1
1  1    1
dtype: int64

0.18.1 (including 0.18.1+367.g6b7857b)

a  b
0  1    2
   2    1
1  1    1
Name: b, dtype: int64

This change in behavior is not completely unexpected given that outside of groupby(), value_counts() has historically returned a Series named equivalently to the column the operation was performed on:

df.a.value_counts()
0    3
1    1
Name: a, dtype: int64

A manual workaround would be to rename the Series before reset_index() as follows:

g = df.groupby('a').b.value_counts()
g.name = 0
g.reset_index()
   a  b  0
0  0  1  2
1  0  2  1
2  1  1  1

However, the one-line functionality was much appreciated. Being able to pass a new name to value_counts() could solve this issue? #### output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1  (also verified with 0.18.1+367.g6b7857b)
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Usage Question

Source

dcroote

Most helpful comment

Probably a result of https://github.com/pydata/pandas/issues/12363 fixing groupby sometimes losing the name.

In this case I'd say that

In [37]: df.groupby('a').b.value_counts().reset_index(name='counts')
Out[37]:
   a  b  counts
0  0  1       2
1  0  2       1
2  1  1       1

is even clearer than your original. Thoughts?

TomAugspurger on 16 Aug 2016

👍6

All 2 comments

Probably a result of https://github.com/pydata/pandas/issues/12363 fixing groupby sometimes losing the name.

In this case I'd say that

In [37]: df.groupby('a').b.value_counts().reset_index(name='counts')
Out[37]:
   a  b  counts
0  0  1       2
1  0  2       1
2  1  1       1

is even clearer than your original. Thoughts?

TomAugspurger on 16 Aug 2016

👍6

Even better, thanks!

dcroote on 16 Aug 2016

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Suffixes ignored on second merge

MatzeB · 3Comments

can't plot multi-row subplots

ericdf · 3Comments

Interpolate (upsample) non-equispaced timeseries into equispaced 18.0rc1

marcelnem · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

jaradc · 3Comments