Pandas: groupby().first() much slower with a str column present in the data.

Created on 17 Jan 2018 · 12Comments · Source: pandas-dev/pandas

(I copied this code from a jupyter notebook)

import pandas as pd
import sys
pd.options.display.max_rows = 10
print('pandas version', pd.__version__)
print('python version', sys.version)

#pandas version 0.22.0
#python version 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]

msgs = pd.DataFrame({ 'orderid':pd.np.random.random_sample(size=100000)
                     ,'qty':pd.np.random.random_sample(size=100000)})
msgs['date'] = '1900-01-01'
msgs['textcol'] = 'lorem ipsum etc'
msgs.info()

# omits textcol in data  takes 59 ms
g = msgs[['date','orderid','qty']].groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

# has textcol in data  takes 10.6 s
g = msgs.groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)

Problem description

I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)

Expected Output

When I run the same code under 0.21.1 the times are 55 ms and 67 ms.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Performance Regression

Source

jdmarino

👍4

Most helpful comment

Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a min_count parameter to https://github.com/pandas-dev/pandas/blob/238499ab0a48a0ad4a2011e2ce1c6a02c86124eb/pandas/_libs/groupby.pyx#L35, and asserting that it's always -1.

TomAugspurger on 30 Jan 2018

👍2

All 12 comments

~I'm unable to reproduce your slowdown~:

On 0.21.1:

In [6]: %timeit g1.first()
2.63 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [7]: %timeit g2.first()
7.96 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [8]: pd.__version__
Out[8]: '0.21.1'

On 0.22.0:

In [8]: g1 = msgs[['date','orderid','qty']].groupby(['date','orderid'])

In [9]: g2 = msgs.groupby(['date', 'orderid'])

In [10]: %timeit g1.first()
2.49 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [11]: %timeit g2.first()
5.07 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [12]: pd.__version__
Out[12]: '0.22.0'

Are there any other differences in your environments? 0.22.0 only contained 1 change relative to 0.21.1, and I'd be surprised if it had an impact on Groupby.first.

Edit: Whoops, I'm apparently unable to read. I missed the ms vs. s in the second groupby. Consider me surprised then.

TomAugspurger on 17 Jan 2018

It might be worth removing your line "I'm unable to reproduce your slowdown". People like me might stop reading right there assuming the OP was a crackpot. ;-)

jdmarino on 17 Jan 2018

👍1

Thanks @jdmarino - seems the min_count changes in #18921 is causing the groupy dispatch to the right cython method to fail. Fortunately/unfortunately there's a fallback so tests didn't pick it up. As a slightly faster workaround you could use nth(0).

In [18]: g2 = msgs.groupby(['date', 'orderid'])

In [19]: %timeit g2.nth(0)
64.2 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [16]: from pandas.testing import assert_frame_equal

In [17]: assert_frame_equal(g2.nth(0), g2.first())

@TomAugspurger - looks like the 1 in this line is issue?
https://github.com/pandas-dev/pandas/blob/65d1b62dfe46cd5cfd80d8b1aafd633575c72ee6/pandas/core/groupby.py#L2147

def fail():
    raise Exception

g2 = msgs.groupby(['date', 'orderid'])

g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-220c58451090> in <module>()
----> 1 g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
   3607                             min_count=-1):
   3608         new_items, new_blocks = self._cython_agg_blocks(
-> 3609             how, alt=alt, numeric_only=numeric_only, min_count=min_count)
   3610         return self._wrap_agged_blocks(new_items, new_blocks)
   3611 

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
   3651             try:
   3652                 result, _ = self.grouper.aggregate(
-> 3653                     block.values, how, axis=agg_axis, min_count=min_count)
   3654             except NotImplementedError:
   3655                 # generally if we have numeric_only=False

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in aggregate(self, values, how, axis, min_count)
   2312     def aggregate(self, values, how, axis=0, min_count=-1):
   2313         return self._cython_operation('aggregate', values, how, axis,
-> 2314                                       min_count=min_count)
   2315 
   2316     def transform(self, values, how, axis=0):

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_operation(self, kind, values, how, axis, min_count)
   2268             result = self._aggregate(
   2269                 result, counts, values, labels, func, is_numeric,
-> 2270                 is_datetimelike, min_count)
   2271         elif kind == 'transform':
   2272             result = _maybe_fill(np.empty_like(values, dtype=out_dtype),

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _aggregate(self, result, counts, values, comp_ids, agg_func, is_numeric, is_datetimelike, min_count)
   2330                          min_count)
   2331         else:
-> 2332             agg_func(result, counts, values, comp_ids, min_count)
   2333 
   2334         return result

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in wrapper(*args, **kwargs)
   2169 
   2170                 def wrapper(*args, **kwargs):
-> 2171                     return f(afunc, *args, **kwargs)
   2172 
   2173                 # need to curry our sub-function

~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in <lambda>(func, a, b, c, d, e)
   2113             'first': {
   2114                 'name': 'group_nth',
-> 2115                 'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)
   2116             },
   2117             'last': 'group_last',

pandas/_libs/groupby.pyx in pandas._libs.groupby.group_nth_object()

TypeError: group_nth_object() takes exactly 5 positional arguments (6 given)

chris-b1 on 17 Jan 2018

Thanks @chris-b1. I added a min_count to group_nth_ in groupby_helper, but group_nth_object is defined separately, and so lost it. Adding in an unused min_count to group_nth_object will fix things.

TomAugspurger on 17 Jan 2018

Hi,

I would like to add that we are having the same issue. We have a block of code that groupsby and then call first as follows:

>> groups = self.df.groupby(['col_name'])
>> firsts = groups.first()

Under pandas 0.21 it takes: less than 3 seconds
Under pandas 0.22 it takes: between 2 and 3 mins

lv10 on 29 Jan 2018

The same performance issue is also found for function GroupBy.last

albertvillanova on 30 Jan 2018

👍1

TomAugspurger on 30 Jan 2018

👍2

http://pandas.pydata.org/pandas-docs/stable/contributing.html

jreback on 30 Jan 2018

👍1

A friendly question:
Has this problem been resolved or the recommended solution remains to use .nth(0)?

Tears fall down my face due to how slow .first() and .last() are.
I have Anaconda 5.1 installed, and pd.__version__ returns '0.22.0'.

I could use .nth(0) instead of .first(), but I am not quiet sure what to do with .last(), apart from doing the computations on Dataframe.as_matrix() with numpy.

sursu on 11 Apr 2018

This is fixed in the soon-ish to be released 0.23.0.

On Wed, Apr 11, 2018 at 3:55 AM, Sandu Ursu notifications@github.com
wrote:

A friendly question:
Has this problem been resolved or the recommended solution remains to use
.nth(0)?

Tears fall down my face due to how slow .first() and .last() are.
I have the latest Anaconda version installed, and pd.__version__ returns
'0.22.0'.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/19283#issuecomment-380378598,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_BPiCUx_fYw2MiTV5vwA-vOnjdC9ks5tncT-gaJpZM4Rhekl
.

chris-b1 on 11 Apr 2018

Opened an issue here: #20657

I also find that .last() does not perform as expected (might be related, not sure whether I need to open a new issue):

Example:

df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
 [179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
 [179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
 [179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
 [179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
 [179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]],
columns=['SessionID', 'PageTime', 'ReferrerURL', 'PageID'])

Now, when I run:
df.groupby('SessionID').last()
I get:

SessionID | PageTime | ReferrerURL | PageID
-- | -- | -- | --
179293473 | 2016-06-01 00:04:22.346018 | https://www.information.dk/indland/2016/05/hve... | 39125224
179773461 | 2016-06-01 22:27:00.857224 | https://www.google.dk | 32718401

When, in fact, I was expecting the same result as obtained from:
df.groupby('SessionID').nth(-1)

SessionID | PageID | PageTime | ReferrerURL
-- | -- | -- | --
179293473 | 39125224 | 2016-06-01 00:04:22.346018 | https://www.information.dk/indland/2016/05/hve...
179773461 | 32718401 | 2016-06-01 22:27:00.857224 | NaN

And while we are at .nth(), why does it mix up my column order?

sursu on 11 Apr 2018

I think there's already an open issue for that, if not please make one.

On Wed, Apr 11, 2018, 8:04 AM Sandu Ursu notifications@github.com wrote:

I also find that .last() does not perform as expected:

Example:

df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
[179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
[179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
[179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
[179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]])

Now, when I run:
pv.groupby('SessionID').last()
I get:
SessionID PageTime ReferrerURL PageID
179293473 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve... 39125224
179773461 2016-06-01 22:27:00.857224 https://www.google.dk 32718401

When, in fact, I was expecting the same result as obtained from:
pv.groupby('SessionID').nth(-1)
SessionID PageID PageTime ReferrerURL
179293473 39125224 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve...
179773461 32718401 2016-06-01 22:27:00.857224 NaN

And while we are at .nth(), why does it mix up my column order?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/19283#issuecomment-380427674,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_P2fiCF6SGT2rKPGaaYpB1qUTtlvks5tnfE6gaJpZM4Rhekl
.