@ml31415 and I have just created/updated an aggregation package which has multiple equivalent implementations: pure python, numpy, pandas, and scipy.weave. As shown on the readme, pandas is slower than a careful numpy implementation for most aggregation functions, and slower than scipy.weave by a fairly wide margin in all cases. The worst functions appear to be any and all. Note that the times shown include construction of the dataframe, although whatever overhead this incurs cannot be more than the minimum time shown (11ms for groupby.first()).
Surely pandas should be able to more or less match at least the numpy implementation, and maybe even the weave version (since it looks like pandas is also getting at low level C using cython)?
Note that although pandas may end up faster when performing multiple aggregations of the same grouping (e.g. first, last and max), the other implementations also have this potential (admittedly this optimization is yet to be implemented though).
You should be clear that you are NOT apples-to-apples
In [10]: np.random.seed(1234)
In [11]: df = DataFrame({'A' : np.random.randint(0,1000,size=500000),'B' : np.random.randn(500000) })
In [12]: g = df.groupby('A').B
In [14]: %timeit g.sum()
100 loops, best of 3: 2 ms per loop
In [15]: %timeit g.mean()
100 loops, best of 3: 2.06 ms per loop
In [16]: %timeit g.max()
100 loops, best of 3: 2.12 ms per loop
I already alluded to the apples-to-apples problem. Despite that, there are two clear perf issues.
Firstly, all and any give very poor performance:
In: %timeit g.max()
1000 loops, best of 3: 1.97 ms per loop # for comparison on my computer
In: %timeit g.any()
10 loops, best of 3: 104 ms per loop
In: %timeit g.all()
10 loops, best of 3: 104 ms per loop
Secondly, the first time you execute anything on a groupby instance you get a large overhead:
In: g = df.groupby('A').B
In: %timeit g.sum()
The slowest run took 10.67 times longer than the fastest.
This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.9 ms per loop
presumably that large overhead is due to a sort, which would be helpful in some cases, but often is unnecessary.
.all & .any are not currently implemented in cython, so they all back to a slower path. I guess could be implemented, but not very useful IMHO on anything but bool cols.
Of course the initial creation of a groupby object is important (and cached).
Further, numpy has a hard time dealing with lots of edge cases and dtypes. That's why pandas exists in the first place.
all and any are common grouping operations, e.g. just to support some grouped allnan or anynan functionality. Surely they interpret their inputs as bool, but it may also make sense to run it on any int or float column. The decision, if this usecase is frequent enough to do some tweaking there is of course up to you.
For the apples/apples story. I haven't looked into the pandas code details, but I suppose what happens there is that some sort index is created and cached. For benchmarking it seems fair to me, to include this index building into the benchmark. Whenever the data is changed, NA thrown out, data added or removed, the index needs to be recreated. And of course for any first run it has to be created, or when the aggregation shall take place over different columns. So it's a not that uncommon case.
Given the caching, this performance issue may hit only some of the pandas users, depending on their usecase. It may be fair enough to say, the current implementation is fast enough. But even if the creation of the sort index is not taken into account, pandas is not faster than the aggregate C implementation, which does not use any precached index.
The sorting is the main time waster for the grouping operations, and it can be avoided for many functions using indirections when accessing the data. The sorting is pure overhead, if the grouping operations are built accordingly. I learned this the hard way, when I tried to match matlabs accumarray speeds with a similar sort-based approach. Whatever I tried, the sort alone was already slower than the whole grouping operation in matlab. To stress this point, pandas is not even faster in the benchmarking, if the pre-sorting is not taken into account, just about comparable.
you are missing the point, .all and .any are reducing operations, NOT aggregating operations in this case. Of course you GROUP by them, rarely would you actually AGGREGATE by them.
are you numpy routines also calculating the indexer?
we don't pre-cache anything. The point however, is that measure a SINGLE aggregate operation is generally not that useful. You oftn use these group indexers many times, e.g.
g = df.groupby(....)
g.max()-g.min()
etc cetera.
The numpy implementations use np.bincount or fancy indexing wherever possible. This is also not building any kind of sorted index.
Precaching was maybe ambiguous, but I guess we both know what I was talking about: The caching of a sort index on the first groupby instruction.
I agree, that often group by operations come in sequences. Nevertheless, the sorting remains overhead, and the less consequent grouped operations, the bigger this overhead. For the worst case, only a single grouped instruction as in the benchmark, looks like the overhead sums up to about a 10-fold execution time compared to an unsorted algorithm. And even worse, despite this heavy initial work, it's never getting faster than the unsorted algorithm.
FYI, you can simply pass sort=False to not sort.
Hmm, I hadn't spotted that, although simply doing:
g = df.groupby('A',sort=False).B
%timeit g.sum()
doesn't seem to help, you still get the same message: The slowest run took 9.59 times longer than the fastest. This could mean that an intermediate result is being cached. Have I done it wrong?
I am unclear here what is your goal? pandas creates a groupby object which has nice properties for returning objects with correct indices, handling multiple dtypes, many different types of input objects and so on.
If you would like to dig into the actual groupby implementation would be ok. On an apples to apples basis I am not sure what is your point though. (about sorting or otherwise).
You are trying to demonstrate a very very narrow point which IMHO doesn't have much utility.
Yes, pandas is good at all those things, and that's its main job, but I kind of hoped that it would also be getting close to optimal performance, which appears not to be the case here.
Maybe perf is not really that important to most people for the groupby machinery, but I thought it was worth pointing out that there is room for improvement - even if only for some subset of cases.
@d1manson you are still missing the point. You are not comparing apples to apples, so how does it matter if something is faster than something that IS NOT COMPARABLE???
The point is not about sorting, it's about the time it consumes, even if it's not necessary in many cases. The sort flag seems to be of no help here, as it probably has to be done anyways, when the algorithms expect sorted input. If you're totally happy with the current groupby solution, we're totally fine with that.
To me it just looks like there is some major room for speed improvements. Even if you run 3 grouped operations in a sequence, they're still several times (roughly (10+3) / 3 times) slower compared to three group operations with an algorithm not requireing sorting. How many operations do you want to chain for an apples to apples comparison? Overhead is overhead.
Again, if you're happy with the current solution, just let it be then. It's surely not an urgent issue.
You are passing an indexer to the numpy code??? how is that apples-to-apples?
Where do we pass an indexer to the numpy code?
why don't you post actual code here
Cause the link to all the code is already given in the first post.
So this is your code?
import numpy as np
from aggregate import aggregate
group_idx = np.array([3,0,0,1,0,3,5,5,0,4])
a = np.array([13.2,3.5,3.5,-8.2,3.0,13.4,99.2,-7.1,0.0,53.7])
aggregate(group_idx, a, func='sum', fill_value=0) # see below for further examples
so you are pre-computing the group_idx, which takes ALL the time. hence this IS not comparable.
group_idx is not computing anything, it's given. Just like the column you're grouping by.
I am all for performance comparisions and improvements and such. But it IS extermely important to profile and make sure you are actually comparing things properly, under different scenarios. E.g. if you have 2000 or 2 groups makes a lot of difference.
Since pandas does extra work to compute an indexer, this is a very important point. This is what takes almost ALL of the time.
To me it just looks like there is some major room for speed improvements. Even if you run 3 grouped operations in a sequence, they're still several times (roughly (10+3) / 3 times) slower compared to three
This statement is not backed up by ANY evidence.
See our benchmarking or run the benchmarks yourself. I had modified the benchmarks, to reuse the search index. When the index is present, C implentation and pandas are about comparable. Doing the math, you get the above estimation of the overhead.
In case I have not been clear
In [165]: N = 500000
In [166]: df = DataFrame({'key' : np.random.randint(0,1000,size=N), 'value' : np.random.randn(N) })
In [169]: %timeit pd.factorize(df['key'])[0]
100 loops, best of 3: 7.57 ms per loop
In [170]: %timeit df.groupby('key').value.sum()
100 loops, best of 3: 10.9 ms per loop
In my example. This takes 3.5 ms to actually compute the sum, NOT 10.9. 7.6 is spent computing what you are calling the group_idx.
You _rarely_ actually already have a group indexer done for you.
Yes. That's true. I didn't get what you meant initially-I knew there was overhead somewhere but it took a while to realize exactly why it has to be there.
Although, why is it still true if you use the index for grouping [edit: using int64index copy=False]:
In: df = pd.DataFrame({'B' : np.random.randn(500000) },
index=pd.Int64Index(np.random.randint(0,1000,size=500000),copy=False))
In: df.index
Out: Int64Index([755, 940, ...], dtype='int64')
# index already exists at this point, so there should be no need to sort
In: g = df.groupby(level=0)
In: %timeit g.sum()
paraphrasing: "~9x slower on first run, 2ms best"
In: g = df.groupby(level=0, sort=False)
In: %timeit g.sum()
paraphrasing: "~7x slower on first run, 2ms best"
Ok, so interestingly the sort arg is doing something here, but it's still not optimal
It is not set up to take an indexer directly ATM. because you can simply use a groupby object, never a need for this. It would be a trivial change to allow an already factorized grouper if that was really wanted.
I don't get what you mean by "you can simply use a groupby object".
So are you saying that whenever you use groupby with a level rather than a column, you get an unnecessary factorize operation which results in a ~7-9x perf penalty? If so, then I feel like this issue is a reasonable one to raise.
Note that the aggregate function we have implemented supports multi-dimensional indexing, which is equivalent here to grouping by multiple levels, and it's dealt with using numpy.ravel_multi_index...although this may not be quite so simple for pandas.
no, what I mean is that I have never seen an actual indexer passed, you are passing a column or a level or whatever
In [177]: df = DataFrame({'key' : np.random.randint(0,3,size=10), 'value' : np.random.randn(10) })
In [178]: df
Out[178]:
key value
0 2 1.541759
1 1 -1.312925
2 1 -1.108687
3 2 0.208367
4 0 0.436229
5 2 -0.404643
6 2 -0.125485
7 1 -0.422631
8 1 0.593898
9 2 1.650386
In [179]: g = df.groupby('key')
In [180]: g.groups
Out[180]: {0: [4], 1: [1, 2, 7, 8], 2: [0, 3, 5, 6, 9]}
I will reiterate you rarely have an actual indexer even in a column. E.g. say you are grouping on a date or an object or whatever. (and don't even get started when you have multiple-indexers). For an actual conforming int64 index you can prob take some shortcuts I suppose.
but groupby accepts level as an argument, so surely there is an expectation that people do want to group by things that already have an index?
perhaps I am missing something: at what point is the indexer actually built/cached? In my example I had the comment # index already exists at this point, so there should be no need to sort, but perhaps the indexer wasn't cached so had to be recreated for the grouping operation?
Well, I want to have actual group labels for what I am doing which are not necessarily equal (in fact rarely) equal to the actual values that I am grouping by. In numpy I suppose this doesn't matter at all. But I care about the labels of the resultant index. Doing simple numpy operations is pretty trivial. Doing the same operations and propogating labels is not. That's why the groupby objects exist.
So yes you are missing the difference between all of the work a user has to do to have proper label mapping in numpy vs. a nice API and very close performance-wise pandas code (which handle dtypes and nan etc etc).
I certainly welcome performance comparisons and criticism, however, again I will reiterate that you are not providing the same API so you are not comparing the same things.
Have a look at this issue, where a similar topic was discussed https://github.com/pydata/pandas/issues/8881
I'm still confused. I get the benefit of pandas as I use it often in
roughly the way it is intended. But the particular point we are discussing
here is groupby using levels, where all the index values and labels already
exist. So all I'm saying is that there's no need to recompute them, if you
care about perf.