>>> def applym(df):
... print df.irow(0)['a']
... return DataFrame({'a':[1],'b':[1]})
>>> df = DataFrame({'a':[1,1,1,2,3],'b':['a','a','a','b','c']})
>>> df.groupby('b').apply(applym)
1
1
2
3
a b
b
a 0 1 1
b 0 1 1
c 0 1 1
>>>
applym is called twice on the first group.
I believe think this is a duplicate of #2656 and not a bug...
The function is called twice for internal implementation reasons but the results are correct.
Possibly the docs should be updated to indicate that func passed to apply should not have side-effects.
I see. I agree with the suggestion on the other thread that there should be a way I can tell apply that there is no side effects so that it does not have to run the first group twice.
At the very least, it shouldn't run the first group twice if there is only one group.
hmm...well, what do you actually need guaranteed and why? just that the function is only called once per group? (or do you need a specific ordering, too?)
there's no way to add an option "i want to take the fast path always" because the fast path makes certain assumptions about memory aliasing and such and can cause segfault if they're incorrect. plus, the fast path implementation could change or an even faster path could be implemented, all of which are internal implementation details.
so it only makes sense to allow an option to force it to take the slow path, and the only use case for that would be if you depended on the order or number of calls (which would only matter if your function had side-effects, not vice-versa, unless your apply function is very expensive and you're worried about the CPU cycles...). is there a specific reason you need that?
The case I am came across, is expensive apply function with few groups. My apply takes 30 seconds to do so running the first group twice adds thirty seconds to the runtime. Though now that I know of this double run, I have changed my code to do the split-apply-concat step manually. I think I will try to use a global variable to just skip through the first run next time.
ok, if your apply is that expensive that makes sense then. (so maybe it should be an option to force the basic path)
Marked as enhancement for someday. Just have to get a groupby parameter to flow through
Why, if I use a lambda expression, doesn't run twice the first iteration?
>>> df = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
>>> identity = lambda row: print(tuple(row))
>>> df2 = df.apply(identity, axis=1)
('x', 1)
('y', 2)
I'd like to consider adopting dask's behavior here, where the user provides a meta keyword to disable any inference.
An empty ``pd.DataFrame`` or ``pd.Series`` that matches the dtypes and
column names of the output. This metadata is necessary for many algorithms
in dask dataframe to work. For ease of use, some alternative inputs are
also available. Instead of a ``DataFrame``, a ``dict`` of ``{name: dtype}``
or iterable of ``(name, dtype)`` can be provided. Instead of a series, a
tuple of ``(name, dtype)`` can be used. If not provided, dask will try to
infer the metadata. This may lead to unexpected results, so providing
``meta`` is recommended. For more information, see
``dask.dataframe.utils.make_meta``.
This has worked quite well for dask.
i don鈥檛 think we need to do this at all
rather just compute the path once and use it
Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.
An option to force the path would be quite appreciated.
Most helpful comment
Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower.
An option to force the path would be quite appreciated.