Pandas: Inconsistent behavior when using GroupBy and pandas.Series.mode

Created on 7 Mar 2019 · 2Comments · Source: pandas-dev/pandas

Code Sample

import pandas

# Works great
df1 = pandas.DataFrame([[20,'A'],[20,'B'],[10,'C']])
gb1 = df1.groupby(0).agg(pandas.Series.mode)
display(gb1)

# Exception: Must produce aggregated value
df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
gb2 = df2.groupby(0).agg(pandas.Series.mode)
display(gb2)

Problem Description

As it seems, the above code works great for df1, returning the following result:

0         
10       C
20  [A, B]

(where C is a str, and [A, B] is a numpy.ndarray)

However, it doesn't work for df2, throwing the following exception:

...
C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby\generic.py in _aggregate_named(self, func, *args, **kwargs)
    907             output = func(group, *args, **kwargs)
    908             if isinstance(output, (Series, Index, np.ndarray)):
--> 909                 raise Exception('Must produce aggregated value')
    910             result[name] = self._try_cast(output, group)
    911 

Exception: Must produce aggregated value

It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.

In my opinion, item may be related to #2656 or one of its root causes ("fast apply vs. old pathway" as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn't).

Expected Output

0         
20  [A, B]
30       C

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Groupby Needs Discussion

Source

jointfull

👍1

Most helpful comment

As explained in the https://github.com/pandas-dev/pandas/issues/19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas.

j-musial on 15 Mar 2019

👍4 ❤2

All 2 comments

Yea I see where this is confusing but generally we don't support agg reducing to anything but a scalar, so I would think the first example returning something is pure coincidence and if anything the second example would be what we should be doing here.

Others may disagree though so let's see. FWIW apply is a more general function so you'd be safer to do something like:

df2.groupby(0)[1].apply(lambda x: list(x.mode()))

WillAyd on 7 Mar 2019

👎1

j-musial on 15 Mar 2019

👍4 ❤2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

read_csv(filename_with_asian_locale) failed in python 3.6 for windows

mfmain · 3Comments

AttributeError: Cannot use pandas from a script file

songololo · 3Comments

Better display of negative Timedelta

scls19fr · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments