Pandas: ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply

Created on 26 Feb 2013 · 10Comments · Source: pandas-dev/pandas

>>> def applym(df):
...     print df.irow(0)['a']
...     return DataFrame({'a':[1],'b':[1]})

>>> df = DataFrame({'a':[1,1,1,2,3],'b':['a','a','a','b','c']})

>>> df.groupby('b').apply(applym)
1
1
2
3
     a  b
b        
a 0  1  1
b 0  1  1
c 0  1  1
>>>

applym is called twice on the first group.

API Design Enhancement Groupby Performance

Source

Aico

👍2

Most helpful comment

Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.

An option to force the path would be quite appreciated.

erwanp on 5 Jun 2018

👍3

All 10 comments

I believe think this is a duplicate of #2656 and not a bug...

The function is called twice for internal implementation reasons but the results are correct.

Possibly the docs should be updated to indicate that func passed to apply should not have side-effects.

stephenwlin on 26 Feb 2013

👎1

I see. I agree with the suggestion on the other thread that there should be a way I can tell apply that there is no side effects so that it does not have to run the first group twice.

At the very least, it shouldn't run the first group twice if there is only one group.

Aico on 26 Feb 2013

hmm...well, what do you actually need guaranteed and why? just that the function is only called once per group? (or do you need a specific ordering, too?)

there's no way to add an option "i want to take the fast path always" because the fast path makes certain assumptions about memory aliasing and such and can cause segfault if they're incorrect. plus, the fast path implementation could change or an even faster path could be implemented, all of which are internal implementation details.

so it only makes sense to allow an option to force it to take the slow path, and the only use case for that would be if you depended on the order or number of calls (which would only matter if your function had side-effects, not vice-versa, unless your apply function is very expensive and you're worried about the CPU cycles...). is there a specific reason you need that?

stephenwlin on 27 Feb 2013

The case I am came across, is expensive apply function with few groups. My apply takes 30 seconds to do so running the first group twice adds thirty seconds to the runtime. Though now that I know of this double run, I have changed my code to do the split-apply-concat step manually. I think I will try to use a global variable to just skip through the first run next time.

Aico on 27 Feb 2013

ok, if your apply is that expensive that makes sense then. (so maybe it should be an option to force the basic path)

stephenwlin on 27 Feb 2013

Marked as enhancement for someday. Just have to get a groupby parameter to flow through

wesm on 20 Mar 2013

👍1

Why, if I use a lambda expression, doesn't run twice the first iteration?

>>> df = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
>>> identity = lambda row: print(tuple(row))
>>> df2 = df.apply(identity, axis=1)
('x', 1)
('y', 2)

jrovegno on 28 Feb 2018

I'd like to consider adopting dask's behavior here, where the user provides a meta keyword to disable any inference.

An empty ``pd.DataFrame`` or ``pd.Series`` that matches the dtypes and
column names of the output. This metadata is necessary for many algorithms
in dask dataframe to work.  For ease of use, some alternative inputs are
also available. Instead of a ``DataFrame``, a ``dict`` of ``{name: dtype}``
or iterable of ``(name, dtype)`` can be provided. Instead of a series, a
tuple of ``(name, dtype)`` can be used. If not provided, dask will try to
infer the metadata. This may lead to unexpected results, so providing
``meta`` is recommended. For more information, see
``dask.dataframe.utils.make_meta``.

This has worked quite well for dask.

TomAugspurger on 9 Mar 2018

i don’t think we need to do this at all
rather just compute the path once and use it

jreback on 9 Mar 2018

Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.

An option to force the path would be quite appreciated.

erwanp on 5 Jun 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings