Cudf: [FEA] Implement Groupby.last()

Created on 22 Apr 2019 · 14Comments · Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.
As a user, I'd like to be able to get the last non-null value of each group for every non-grouping column in a groupby object.

Describe the solution you'd like
I'd like to be able to do the equivalent of the following pandas code:

import pandas as pd

ser1 = [1,1,2,2,33,33]
ser2 = [1,1,2,19,4,None,]
pdf = pd.DataFrame({'a':ser1, 'b':ser2})
print(pdf)

last_row = pdf.groupby(['a']).last()
last_row
    a     b
0   1   1.0
1   1   1.0
2   2   2.0
3   2  19.0
4  33   4.0
5  33   NaN
    b
a   
1   1.0
2   19.0
33  4.0

Describe alternatives you've considered
I don't think there are any simple alternatives at the cudf level for doing this on groupby objects.

Additional context
This will help us downstream with dask-cudf for the cumulative operations (cumsum, max, min, product).

In pandas, it appears to be coming from here.

cuDF (Python) feature request

Source

beckernick

Most helpful comment

Yes. I understand. And I think this is possible.

devavret on 26 May 2020

🚀2

All 14 comments

I believe this could only work for sort based groupbys as a hash based groupby would have non-useful and/or non-deterministic ordering.

kkraus14 on 22 Apr 2019

👍1

@kkraus14 I'm also working on a workflow using dask-cudf that utilizes the first and last methods. If it is something that may not be possible with a hash based groupby any suggestions on how we could use the sort based groupby with Dask's groupby and then utilize these aggregations?

ayushdg on 14 May 2019

Perhps this could now be implemented in the Python layer as a wrapper around groupby.nth, with some care around how nulls are handled?

df.groupby(col).nth(n=-1)

beckernick on 26 May 2020

Even in nth I believe we make no guarantee on the order of the elements within a group, so say you had a situation with an input table:
| a | b |
| - | - |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |

And ran df.groupby(['a']).nth(0) it can effectively return any value of b.

kkraus14 on 26 May 2020

For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?

beckernick on 26 May 2020

For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?

A sort-based groupby only sorts by the grouping key which is a in the above example. Within each group the order is either non-deterministic (unstable sort) or an arbitrary order (stable sort), but is neither sorted nor order-maintaining.

kkraus14 on 26 May 2020

@devavret @karthikeyann would a groupby nth with n==-1 work for a Pandas groupby last?

jrhemstad on 26 May 2020

@devavret @karthikeyann would a groupby nth with n==-1 work for a Pandas groupby last?

Yes
https://github.com/rapidsai/cudf/blob/66a77e35f517d9b52f26a688ee7ef7b9d4693c66/cpp/src/groupby/sort/group_nth_element.cu#L56-L58

devavret on 26 May 2020

For sort-based groupbys this would be possible, though, right?. Do you think it's worth including such an approach in the API surface if it would work for sort but not hash based groupbys?

A sort-based groupby only sorts by the grouping key which is a in the above example. Within each group the order is either non-deterministic (unstable sort) or an arbitrary order (stable sort), but is neither sorted nor order-maintaining.

Actually, due to the very nature of aggs like nth and quantiles, sort-groupby has the ability to sort values within the groups as well.

However, this does not seem to be in use by nth right now.

https://github.com/rapidsai/cudf/blob/66a77e35f517d9b52f26a688ee7ef7b9d4693c66/cpp/src/groupby/sort/groupby.cu#L390

To enable this, it should use get_sorted_values() like quantiles does. https://github.com/rapidsai/cudf/blob/66a77e35f517d9b52f26a688ee7ef7b9d4693c66/cpp/src/groupby/sort/groupby.cu#L324

I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.

In [2]: import pandas as pd                                                                            

In [3]: s = pd.Series([1, 1, 1, 2, 2])                                                                 

In [4]: v = pd.Series([5, 1, 2, 4, 3])                                                                 

In [6]: g = v.groupby(s)                                                                               

In [7]: g.last()                                                                                       
Out[7]: 
1    2
2    3
dtype: int64

devavret on 26 May 2020

I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.

They guarantee the order within groups is the same order of rows from the input table, which I imagine is an extremely expensive / difficult guarantee for us to make.

kkraus14 on 26 May 2020

I presume it wasn't done this way because pandas' Groupby.last uses stable ordering rather than sorting.

They guarantee the order within groups is the same order of rows from the input table, which I imagine is an extremely expensive / difficult guarantee for us to make.

Actually, code change wise, we can ensure that by substituting sorted_order with stable_sorted_order in the key sorting code. In an old benchmark I ran, I found that the performance difference wasn't too much.

Is last a fairly common operation? I'd look into making it with hash groupby as it seems possible.

devavret on 26 May 2020

Actually, code change wise, we can ensure that by substituting sorted_order with stable_sorted_order in the key sorting code. In an old benchmark I ran, I found that the performance difference wasn't too much.

It's not about key order, it's the values order. For example:

| a | b |
| - | - |
| 1 | 3 |
| 1 | 1 |
| 1 | 4 |
| 1 | 5 |
| 1 | 2 |

We'd need df.groupby(['a']).last() to return 2 for b and df.groupby(['a']).first() to return 3 for b.

kkraus14 on 26 May 2020

Yes. I understand. And I think this is possible.

devavret on 26 May 2020

🚀2

Awesome discussion in this thread. Just to add some additional context, this feature request originally came the fact that Dask needed groupby.last for non-groupby cumulative aggregation operations.

We instead opted to refactor the cumulative aggregation implementation in https://github.com/dask/dask/pull/4736 .

beckernick on 26 May 2020

Was this page helpful?

0 / 5 - 0 ratings