I have several groupby problems crop up regularly at work**, and almost always have to write custom functions for operating on individual groups - which are then very slow to execute.
Looking for solutions, I came across
http://esantorella.com/2016/06/16/groupby/
and the update
https://github.com/esantorella/hdfe/blob/master/groupby.py
I had to adapt my function a little bit to be compatible with numpy- instead of pandas-indexing, but in one case I tested today, this sped up my code by more than a factor 100, and this wasn't even for the full dataset. In fact (for ~5mio rows), the pandas-native version didn't finish in 2h, whereas the adaptation based on the code linked above ran in ~35sec.
It would be very nice if such optimisations would be taken care of directly by pandas, instead of having to work around like this.
** one example among many: process data from source, deduplicate based on a given signature, process further, keep deduplicated record. Updates to this record need to recalculate any time one of the constituent records has changed
not sure what you are asking for here. by-definition an .apply will be slow as its essentially a python for loop.
note that some apply's can be written to be much more performant. pls give a specific example of what you think is slow.
@jreback The code I linked to is still a python for-loop, but much more efficient by using integers to represent the groups, and index slices if groups are ordered already.
My situation is having to maintain a deduplicated, processed dataset, say tgt, where the source data stays duplicated (and changes over time).
So I have a bridge-table redup that maintains a mapping of the indices (dedup_ind) in tgt back to their original indices (redup_ind) in the source. This is easy to calculate, and if any of the source records change, I need to recalculate the corresponding deduplicated record. Since this is expensive, I only want to do it where necessary.
So, after loading the saved previous bridge table and joining with the new one, I have a have a DataFrame along the lines of:
# dedup_ind dedup_ind_old
# redup_ind
# 0 0 0.0
# 1 0 0.0
# 2 0 4.0
# 3 1 1.0
# 4 1 1.0
# 5 2 2.0
# 6 2 NaN
# 7 3 3.0
# 8 4 4.0
# 9 4 4.0
Then, a simplified version of what I want to do is:
def update_dedup2redup(gr):
# if any record involved in a deduplicate has changed, we recalculate the deduplicate;
# this means we need all the involved redup indices!
if (gr.dedup_ind != gr.dedup_ind_old).any():
return pd.Series(True, index=gr.index, name='updated')
return pd.Series(False, index=gr.index, name='updated')
df.groupby('dedup_ind').apply(update_dedup2redup).reset_index(level=0).updated
Since this was extremely slow, I ended up looking for something faster and found the Groupby thing linked in the OP.
def update_dedup2redup_np(gr):
# optimized version for use with Groupby-class
if (gr[:, 0] != gr[:, 1]).any():
return 1
return 0
# # same as above (but this is ~100-1000x faster)
Groupby(df.dedup_ind).apply(update_dedup2redup_np,
df[['dedup_ind', 'dedup_ind_old']].values,
broadcast=True).astype(bool)
Now, because I simplified my case so much, there actually exists a fully vectorised form which is obviously the fastest, but that's not the point. For example, an interesting difference is whether the data being grouped is sorted already.
N = 100000
ddi = np.random.randint(0, N/2, (N,))
err = np.random.choice([-1, 0, 1], N, p=[0.1, 0.8, 0.1])
df = pd.DataFrame({'dedup_ind' : ddi, 'dedup_ind_old' : ddi + err}, index=pd.Index(range(N), name='redup_ind'))
tic = timeit.default_timer()
df.groupby('dedup_ind').apply(update_dedup2redup).reset_index(level=0)
toc = timeit.default_timer()
print(f'naive: {toc-tic:.2f} sec.')
tic = timeit.default_timer()
Groupby(df.dedup_ind).apply(update_dedup2redup_np, df[['dedup_ind', 'dedup_ind_old']].values, broadcast=True).astype(bool)
toc = timeit.default_timer()
print(f'Groupby (unsorted): {toc-tic:.2f} sec.')
df = df.sort_values('dedup_ind')
tic = timeit.default_timer()
Groupby(df.dedup_ind).apply(update_dedup2redup_np, df[['dedup_ind', 'dedup_ind_old']].values, broadcast=True).astype(bool)
toc = timeit.default_timer()
print(f'Groupby (sorted): {toc-tic:.2f} sec.')
tic = timeit.default_timer()
df.dedup_ind.isin(df.dedup_ind.loc[df.dedup_ind != df.dedup_ind_old])
toc = timeit.default_timer()
print(f'vectorised: {toc-tic:.2f} sec.')
# naive: 42.94 sec.
# Groupby (unsorted): 34.02 sec.
# Groupby (sorted): 0.39 sec.
# vectorised: 0.01 sec.
So my point is: can any of these tricks be used to accelerate custom groupbys? In particular
this is just a special case of an ordered group by - there is an issue somewhere about this
imho this is not that common a case
@jreback
How is this a special case? - it's imho the most basic case to do a groupby on a single column. I also tested the "naive" apply with pre-sorted data as well, and there was no time difference.
But even if it is a special case, a speed-up factor of >100 is not a small improvement.
you are welcome to try
Could you adapt the tags to reflect the situation, please? Cheers
@h-vetinari Many groupby ops require _not_ sorting the data, e.g. df.groupby(...).head(). While a bit more onus is put on the user, it seems to me that's where the control should lie. Internally, groupby does use integer codes and not dictionary lookups.
I don't see any other ideas here for optimization, so closing this issue.