Is your feature request related to a problem? Please describe.
As a user, I'd like to be able to get the last non-null value of each group for every non-grouping column in a groupby object.
Describe the solution you'd like
I'd like to be able to do the equivalent of the following pandas code:
import pandas as pd
ser1 = [1,1,2,2,33,33]
ser2 = [1,1,2,19,4,None,]
pdf = pd.DataFrame({'a':ser1, 'b':ser2})
print(pdf)
last_row = pdf.groupby(['a']).last()
last_row
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 19.0
4 33 4.0
5 33 NaN
b
a
1 1.0
2 19.0
33 4.0
Describe alternatives you've considered
I don't think there are any simple alternatives at the cudf level for doing this on groupby objects.
Additional context
This will help us downstream with dask-cudf for the cumulative operations (cumsum, max, min, product).
In pandas, it appears to be coming from here.
I believe this could only work for sort based groupbys as a hash based groupby would have non-useful and/or non-deterministic ordering.
@kkraus14 I'm also working on a workflow using dask-cudf that utilizes the first and last methods. If it is something that may not be possible with a hash based groupby any suggestions on how we could use the sort based groupby with Dask's groupby and then utilize these aggregations?
Perhps this could now be implemented in the Python layer as a wrapper around groupby.nth, with some care around how nulls are handled?
df.groupby(col).nth(n=-1)
Even in nth I believe we make no guarantee on the order of the elements within a group, so say you had a situation with an input table:
| a | b |
| - | - |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
And ran df.groupby(['a']).nth(0) it can effectively return any value of b.
For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?
For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?
A sort-based groupby only sorts by the grouping key which is a in the above example. Within each group the order is either non-deterministic (unstable sort) or an arbitrary order (stable sort), but is neither sorted nor order-maintaining.
@devavret @karthikeyann would a groupby nth with n==-1 work for a Pandas groupby last?
@devavret @karthikeyann would a groupby
nthwithn==-1work for a Pandas groupby last?
For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?
A sort-based groupby only sorts by the grouping key which is
ain the above example. Within each group the order is either non-deterministic (unstable sort) or an arbitrary order (stable sort), but is neither sorted nor order-maintaining.
Actually, due to the very nature of aggs like nth and quantiles, sort-groupby has the ability to sort values within the groups as well.
However, this does not seem to be in use by nth right now.
To enable this, it should use get_sorted_values() like quantiles does. https://github.com/rapidsai/cudf/blob/66a77e35f517d9b52f26a688ee7ef7b9d4693c66/cpp/src/groupby/sort/groupby.cu#L324
I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.
In [2]: import pandas as pd
In [3]: s = pd.Series([1, 1, 1, 2, 2])
In [4]: v = pd.Series([5, 1, 2, 4, 3])
In [6]: g = v.groupby(s)
In [7]: g.last()
Out[7]:
1 2
2 3
dtype: int64
I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.
They guarantee the order within groups is the same order of rows from the input table, which I imagine is an extremely expensive / difficult guarantee for us to make.
I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.
They guarantee the order within groups is the same order of rows from the input table, which I imagine is an extremely expensive / difficult guarantee for us to make.
Actually, code change wise, we can ensure that by substituting sorted_order with stable_sorted_order in the key sorting code. In an old benchmark I ran, I found that the performance difference wasn't too much.
Is last a fairly common operation? I'd look into making it with hash groupby as it seems possible.
Actually, code change wise, we can ensure that by substituting sorted_order with stable_sorted_order in the key sorting code. In an old benchmark I ran, I found that the performance difference wasn't too much.
It's not about key order, it's the values order. For example:
| a | b |
| - | - |
| 1 | 3 |
| 1 | 1 |
| 1 | 4 |
| 1 | 5 |
| 1 | 2 |
We'd need df.groupby(['a']).last() to return 2 for b and df.groupby(['a']).first() to return 3 for b.
Yes. I understand. And I think this is possible.
Awesome discussion in this thread. Just to add some additional context, this feature request originally came the fact that Dask needed groupby.last for non-groupby cumulative aggregation operations.
We instead opted to refactor the cumulative aggregation implementation in https://github.com/dask/dask/pull/4736 .
Most helpful comment
Yes. I understand. And I think this is possible.