Pandas: PERF/ENH: groupby.apply for user-defined function ~100x slower than ...

Created on 5 Jun 2018 · 7Comments · Source: pandas-dev/pandas

I have several groupby problems crop up regularly at work**, and almost always have to write custom functions for operating on individual groups - which are then very slow to execute.

Looking for solutions, I came across
http://esantorella.com/2016/06/16/groupby/
and the update
https://github.com/esantorella/hdfe/blob/master/groupby.py

I had to adapt my function a little bit to be compatible with numpy- instead of pandas-indexing, but in one case I tested today, this sped up my code by more than a factor 100, and this wasn't even for the full dataset. In fact (for ~5mio rows), the pandas-native version didn't finish in 2h, whereas the adaptation based on the code linked above ran in ~35sec.

It would be very nice if such optimisations would be taken care of directly by pandas, instead of having to work around like this.

** one example among many: process data from source, deduplicate based on a given signature, process further, keep deduplicated record. Updates to this record need to recalculate any time one of the constituent records has changed

Groupby Performance

Source

h-vetinari

👍1

All 7 comments

not sure what you are asking for here. by-definition an .apply will be slow as its essentially a python for loop.

note that some apply's can be written to be much more performant. pls give a specific example of what you think is slow.

jreback on 5 Jun 2018

@jreback The code I linked to is still a python for-loop, but much more efficient by using integers to represent the groups, and index slices if groups are ordered already.

My situation is having to maintain a deduplicated, processed dataset, say tgt, where the source data stays duplicated (and changes over time).

So I have a bridge-table redup that maintains a mapping of the indices (dedup_ind) in tgt back to their original indices (redup_ind) in the source. This is easy to calculate, and if any of the source records change, I need to recalculate the corresponding deduplicated record. Since this is expensive, I only want to do it where necessary.

So, after loading the saved previous bridge table and joining with the new one, I have a have a DataFrame along the lines of:

#            dedup_ind  dedup_ind_old
# redup_ind                          
# 0                  0            0.0
# 1                  0            0.0
# 2                  0            4.0
# 3                  1            1.0
# 4                  1            1.0
# 5                  2            2.0
# 6                  2            NaN
# 7                  3            3.0
# 8                  4            4.0
# 9                  4            4.0

Then, a simplified version of what I want to do is:

def update_dedup2redup(gr):
    # if any record involved in a deduplicate has changed, we recalculate the deduplicate;
    # this means we need all the involved redup indices!
    if (gr.dedup_ind != gr.dedup_ind_old).any():
        return pd.Series(True, index=gr.index, name='updated')
    return pd.Series(False, index=gr.index, name='updated')
df.groupby('dedup_ind').apply(update_dedup2redup).reset_index(level=0).updated

Since this was extremely slow, I ended up looking for something faster and found the Groupby thing linked in the OP.

def update_dedup2redup_np(gr):
    # optimized version for use with Groupby-class
    if (gr[:, 0] != gr[:, 1]).any():
        return 1
    return 0
# # same as above (but this is ~100-1000x faster)
Groupby(df.dedup_ind).apply(update_dedup2redup_np,
                            df[['dedup_ind', 'dedup_ind_old']].values,
                            broadcast=True).astype(bool)

Now, because I simplified my case so much, there actually exists a fully vectorised form which is obviously the fastest, but that's not the point. For example, an interesting difference is whether the data being grouped is sorted already.

N = 100000
ddi = np.random.randint(0, N/2, (N,))
err = np.random.choice([-1, 0, 1], N, p=[0.1, 0.8, 0.1])
df = pd.DataFrame({'dedup_ind' : ddi, 'dedup_ind_old' : ddi + err}, index=pd.Index(range(N), name='redup_ind'))
tic = timeit.default_timer()
df.groupby('dedup_ind').apply(update_dedup2redup).reset_index(level=0)
toc = timeit.default_timer()
print(f'naive: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
Groupby(df.dedup_ind).apply(update_dedup2redup_np, df[['dedup_ind', 'dedup_ind_old']].values, broadcast=True).astype(bool)
toc = timeit.default_timer()
print(f'Groupby (unsorted): {toc-tic:.2f} sec.')
df = df.sort_values('dedup_ind')
tic = timeit.default_timer()
Groupby(df.dedup_ind).apply(update_dedup2redup_np, df[['dedup_ind', 'dedup_ind_old']].values, broadcast=True).astype(bool)
toc = timeit.default_timer()
print(f'Groupby (sorted): {toc-tic:.2f} sec.')
tic = timeit.default_timer()
df.dedup_ind.isin(df.dedup_ind.loc[df.dedup_ind != df.dedup_ind_old])
toc = timeit.default_timer()
print(f'vectorised: {toc-tic:.2f} sec.')
# naive: 42.94 sec.
# Groupby (unsorted): 34.02 sec.
# Groupby (sorted): 0.39 sec.
# vectorised: 0.01 sec.

So my point is: can any of these tricks be used to accelerate custom groupbys? In particular

should the data being grouped be pre-sorted internally?
should an integer representation of groups be used? (haven't looked at how it's done currently, but if it uses dict-lookups, that can add up, I guess)
etc.

h-vetinari on 7 Jun 2018

👍1

this is just a special case of an ordered group by - there is an issue somewhere about this

imho this is not that common a case

jreback on 7 Jun 2018

👎1

@jreback
How is this a special case? - it's imho the most basic case to do a groupby on a single column. I also tested the "naive" apply with pre-sorted data as well, and there was no time difference.

But even if it is a special case, a speed-up factor of >100 is not a small improvement.

h-vetinari on 7 Jun 2018

👍1

you are welcome to try

jreback on 7 Jun 2018

Could you adapt the tags to reflect the situation, please? Cheers

h-vetinari on 7 Jun 2018

@h-vetinari Many groupby ops require _not_ sorting the data, e.g. df.groupby(...).head(). While a bit more onus is put on the user, it seems to me that's where the control should lie. Internally, groupby does use integer codes and not dictionary lookups.

I don't see any other ideas here for optimization, so closing this issue.