import pandas as pd
import sys
pd.options.display.max_rows = 10
print('pandas version', pd.__version__)
print('python version', sys.version)
#pandas version 0.22.0
#python version 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
msgs = pd.DataFrame({ 'orderid':pd.np.random.random_sample(size=100000)
,'qty':pd.np.random.random_sample(size=100000)})
msgs['date'] = '1900-01-01'
msgs['textcol'] = 'lorem ipsum etc'
msgs.info()
# omits textcol in data takes 59 ms
g = msgs[['date','orderid','qty']].groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)
# has textcol in data takes 10.6 s
g = msgs.groupby(['date','orderid'])
%time orders = g.first()
orders.info(null_counts=True)
I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)
When I run the same code under 0.21.1 the times are 55 ms and 67 ms.
pd.show_versions()commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
~I'm unable to reproduce your slowdown~:
On 0.21.1:
In [6]: %timeit g1.first()
2.63 ms ± 51.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit g2.first()
7.96 ms ± 71.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: pd.__version__
Out[8]: '0.21.1'
On 0.22.0:
In [8]: g1 = msgs[['date','orderid','qty']].groupby(['date','orderid'])
In [9]: g2 = msgs.groupby(['date', 'orderid'])
In [10]: %timeit g1.first()
2.49 ms ± 70.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit g2.first()
5.07 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [12]: pd.__version__
Out[12]: '0.22.0'
Are there any other differences in your environments? 0.22.0 only contained 1 change relative to 0.21.1, and I'd be surprised if it had an impact on Groupby.first.
Edit: Whoops, I'm apparently unable to read. I missed the ms vs. s in the second groupby. Consider me surprised then.
It might be worth removing your line "I'm unable to reproduce your slowdown". People like me might stop reading right there assuming the OP was a crackpot. ;-)
Thanks @jdmarino - seems the min_count changes in #18921 is causing the groupy dispatch to the right cython method to fail. Fortunately/unfortunately there's a fallback so tests didn't pick it up. As a slightly faster workaround you could use nth(0).
In [18]: g2 = msgs.groupby(['date', 'orderid'])
In [19]: %timeit g2.nth(0)
64.2 ms ± 463 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [16]: from pandas.testing import assert_frame_equal
In [17]: assert_frame_equal(g2.nth(0), g2.first())
@TomAugspurger - looks like the 1 in this line is issue?
https://github.com/pandas-dev/pandas/blob/65d1b62dfe46cd5cfd80d8b1aafd633575c72ee6/pandas/core/groupby.py#L2147
def fail():
raise Exception
g2 = msgs.groupby(['date', 'orderid'])
g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-12-220c58451090> in <module>()
----> 1 g2._cython_agg_general('first', alt=fail, numeric_only=False, min_count=1)
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_general(self, how, alt, numeric_only, min_count)
3607 min_count=-1):
3608 new_items, new_blocks = self._cython_agg_blocks(
-> 3609 how, alt=alt, numeric_only=numeric_only, min_count=min_count)
3610 return self._wrap_agged_blocks(new_items, new_blocks)
3611
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_agg_blocks(self, how, alt, numeric_only, min_count)
3651 try:
3652 result, _ = self.grouper.aggregate(
-> 3653 block.values, how, axis=agg_axis, min_count=min_count)
3654 except NotImplementedError:
3655 # generally if we have numeric_only=False
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in aggregate(self, values, how, axis, min_count)
2312 def aggregate(self, values, how, axis=0, min_count=-1):
2313 return self._cython_operation('aggregate', values, how, axis,
-> 2314 min_count=min_count)
2315
2316 def transform(self, values, how, axis=0):
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _cython_operation(self, kind, values, how, axis, min_count)
2268 result = self._aggregate(
2269 result, counts, values, labels, func, is_numeric,
-> 2270 is_datetimelike, min_count)
2271 elif kind == 'transform':
2272 result = _maybe_fill(np.empty_like(values, dtype=out_dtype),
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in _aggregate(self, result, counts, values, comp_ids, agg_func, is_numeric, is_datetimelike, min_count)
2330 min_count)
2331 else:
-> 2332 agg_func(result, counts, values, comp_ids, min_count)
2333
2334 return result
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in wrapper(*args, **kwargs)
2169
2170 def wrapper(*args, **kwargs):
-> 2171 return f(afunc, *args, **kwargs)
2172
2173 # need to curry our sub-function
~\AppData\Local\Continuum\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby.py in <lambda>(func, a, b, c, d, e)
2113 'first': {
2114 'name': 'group_nth',
-> 2115 'f': lambda func, a, b, c, d, e: func(a, b, c, d, 1, -1)
2116 },
2117 'last': 'group_last',
pandas/_libs/groupby.pyx in pandas._libs.groupby.group_nth_object()
TypeError: group_nth_object() takes exactly 5 positional arguments (6 given)
Thanks @chris-b1. I added a min_count to group_nth_ in groupby_helper, but group_nth_object is defined separately, and so lost it. Adding in an unused min_count to group_nth_object will fix things.
Hi,
I would like to add that we are having the same issue. We have a block of code that groupsby and then call first as follows:
>> groups = self.df.groupby(['col_name'])
>> firsts = groups.first()
Under pandas 0.21 it takes: less than 3 seconds
Under pandas 0.22 it takes: between 2 and 3 mins
The same performance issue is also found for function GroupBy.last
Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a min_count parameter to https://github.com/pandas-dev/pandas/blob/238499ab0a48a0ad4a2011e2ce1c6a02c86124eb/pandas/_libs/groupby.pyx#L35, and asserting that it's always -1.
A friendly question:
Has this problem been resolved or the recommended solution remains to use .nth(0)?
Tears fall down my face due to how slow .first() and .last() are.
I have Anaconda 5.1 installed, and pd.__version__ returns '0.22.0'.
I could use .nth(0) instead of .first(), but I am not quiet sure what to do with .last(), apart from doing the computations on Dataframe.as_matrix() with numpy.
This is fixed in the soon-ish to be released 0.23.0.
On Wed, Apr 11, 2018 at 3:55 AM, Sandu Ursu notifications@github.com
wrote:
A friendly question:
Has this problem been resolved or the recommended solution remains to use
.nth(0)?Tears fall down my face due to how slow .first() and .last() are.
I have the latest Anaconda version installed, and pd.__version__ returns
'0.22.0'.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/19283#issuecomment-380378598,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_BPiCUx_fYw2MiTV5vwA-vOnjdC9ks5tncT-gaJpZM4Rhekl
.
Opened an issue here: #20657
I also find that .last() does not perform as expected (might be related, not sure whether I need to open a new issue):
Example:
df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
[179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
[179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
[179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
[179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]],
columns=['SessionID', 'PageTime', 'ReferrerURL', 'PageID'])
Now, when I run:
df.groupby('SessionID').last()
I get:
 SessionID | PageTime | ReferrerURL | PageID
-- | -- | -- | --
179293473 | 2016-06-01 00:04:22.346018 | https://www.information.dk/indland/2016/05/hve... | 39125224
179773461 | 2016-06-01 22:27:00.857224 | https://www.google.dk | 32718401
When, in fact, I was expecting the same result as obtained from:
df.groupby('SessionID').nth(-1)
 SessionID | PageID | PageTime | ReferrerURL
-- | -- | -- | --
179293473 | 39125224 | 2016-06-01 00:04:22.346018 | https://www.information.dk/indland/2016/05/hve...
179773461 | 32718401 | 2016-06-01 22:27:00.857224 | NaN
And while we are at .nth(), why does it mix up my column order?
I think there's already an open issue for that, if not please make one.
On Wed, Apr 11, 2018, 8:04 AM Sandu Ursu notifications@github.com wrote:
I also find that .last() does not perform as expected:
Example:
df = pd.DataFrame([[179293473,'2016-06-01 00:00:03.549745','http://www.dr.dk/nyheder/',39169523],[179293473,'2016-06-01 00:04:22.346018','https://www.information.dk/indland/2016/05/hvert-tredje-offer-naar-anmelde-voldtaegt-tide', 39125224],
[179773461, '2016-06-01 22:13:16.588146', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:14:04.059781', 'https://www.google.dk', 31658124],
[179773461, '2016-06-01 22:16:37.230587', np.nan, 31658124],
[179773461, '2016-06-01 22:23:09.847149', 'https://www.google.dk', 32718401],
[179773461, '2016-06-01 22:23:55.158929', np.nan, 32718401],
[179773461, '2016-06-01 22:27:00.857224', np.nan, 32718401]])Now, when I run:
pv.groupby('SessionID').last()
I get:
SessionID PageTime ReferrerURL PageID
179293473 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve... 39125224
179773461 2016-06-01 22:27:00.857224 https://www.google.dk 32718401When, in fact, I was expecting the same result as obtained from:
pv.groupby('SessionID').nth(-1)
SessionID PageID PageTime ReferrerURL
179293473 39125224 2016-06-01 00:04:22.346018
https://www.information.dk/indland/2016/05/hve...
179773461 32718401 2016-06-01 22:27:00.857224 NaNAnd while we are at .nth(), why does it mix up my column order?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/pandas-dev/pandas/issues/19283#issuecomment-380427674,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AB1b_P2fiCF6SGT2rKPGaaYpB1qUTtlvks5tnfE6gaJpZM4Rhekl
.
Most helpful comment
Anybody interested in submitting a fix? It'll involve touching some Cython, but the only change is adding a
min_countparameter to https://github.com/pandas-dev/pandas/blob/238499ab0a48a0ad4a2011e2ce1c6a02c86124eb/pandas/_libs/groupby.pyx#L35, and asserting that it's always -1.