Cudf: Allow User access to groupby groups like pandas

Created on 14 Oct 2019  路  20Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.
Would like to be able to access individual groups from groupby operation.

Pandas:
grouped_df = df.groupby('some_value')
print(grouped_df.groups.keys())

Describe the solution you'd like
Would like a solution similar to pandas method illustrated above.

Describe alternatives you've considered
Tried using the groupby for cudf, could not access keys for each group.

cuDF (Python) feature request libcudf

Most helpful comment

Just to clarify, accepting functions that operate on one group as one DataFrame would mean that we would never be able to optimize it in the future. Pandas' documentation, while warning against using apply() due to performance reasons, recommends transform() which accepts Series level functions. And while Pandas doesn't mention a performance problem with that, we still have the same issue of never being able to optimize it. Therefore, in the docs, would you kindly advise against generic functions in transform(), in favour of named ones. By named functions, I mean like this: df.groupby('A')["B"].transform('sum')

All 20 comments

@jrhemstad @devavret @kkraus14

As part of the refactor, do we want to support this feature? To be clear it enables something like this:

In [17]: a                                                                                                                                                                                    
Out[17]: 
   a  b  c
0  1  1  1
1  1  1  2
2  2  1  3
3  1  2  4
4  3  3  5

In [18]: for name, group in a.groupby(['a', 'b']): 
    ...:     print('') 
    ...:     print(group) 
   a  b  c
0  1  1  1
1  1  1  2   

   a  b  c
3  1  2  4   

   a  b  c
2  2  1  3  

   a  b  c
4  3  3  5

Pros

  • We can drop legacy_groupby entirely -- it currently supports iterating over groups.
  • We can potentially support a large number of groupby operations that may not be supported by libcudf in the short term, e.g., groupby.transform(), groupby.filter(), groupby.rolling().

Cons

  • It's going to be extremely slow for a large number of small groups. It can even be slow for a small number of not-so-large groups.

Personally, I think it's fine to support operations that can be fast for a small number of important cases, and slow (even very slow) for others. Importantly, we'll have tests ready for whenever libcudf implementations become available.

I'm a strong +1 for implementing this. Especially when we enable cuda streams support either directly or via per stream default thread we can also use a CPU / Python thread per group to overlap kernel execution under the hood.

I can't comment on whether we should or not. (also, waiting for @jrhemstad to publicly weigh in on this)

But to achieve this, we only need to expose group_offsets(). Then, using slice(), we can get all the table_views. One caveat though: libcudf groupby() does not store values. It will only give you slices of keys. So it's going to be python side's task to call these methods to get sliced frames that include c.

A slightly more optimized (but not zero-copy) method would be if you want me to get per-row group memberships as an index vector, I can do that with the hash map. Then you'd have to copy_if for each group id into a new table.

This has the potential to be very dangerous. It enables users to do things that can be very slow and then wonder why it's so slow. Consider a groupby that produces 10,000 groups of size 10 and then tries to do some cuDF operations on each group. That's going to be dreadful.

Consider that it may be better to "encourage"/force users to use APIs to express their problem in ways that we can actually parallelize.

We can potentially support a large number of groupby operations that may not be supported by libcudf in the short term, e.g., groupby.transform(), groupby.filter(), groupby.rolling()

If we enable everything all at once (but very slowly), then we may not get concrete feature requests that we can spend the time to create proper implementations for. Instead, a user might see that a particular feature is supported, but extremely slow, and then think the library is slow.

I think it's better to wait until we have proper support for these features where we actually have a chance of being somewhat performant (groupby.rolling comes to mind).

Especially when we enable cuda streams support either directly or via per stream default thread we can also use a CPU / Python thread per group to overlap kernel execution under the hood.

Streams aren't going to save you if someone tries to do operations on 10,000 groups all of size 10.

a user might see that a particular feature is supported, but extremely slow, and then think the library is slow

How about we think about the operations we want to support and filter out all those we think we can never support in libcudf natively. Meaning, assume libcudf groupby.rolling won't be implemented for a while but we know right now that we can support it and it will eventually be performant.

So then we don't publicly expose groups right now, but do it internally for python team to be able to build features quickly, and then optimize later?

If we enable everything all at once (but very slowly), then we may not get concrete feature requests that we can spend the time to create proper implementations for. Instead, a user might see that a particular feature is supported, but extremely slow, and then think the library is slow.

  1. We can still be fast for the small number of large groups case. It will likely be faster than doing those operations on the host and then copying back to the device, which is the only option users might have in some cases.

  2. We should be explicit about what is slow and what is fast and why in our documentation. That will help users make better decisions about how to work with cuDF efficiently.

This has the potential to be very dangerous. It enables users to do things that can be very slow and then wonder why it's so slow. Consider a groupby that produces 10,000 groups of size 10 and then tries to do some cuDF operations on each group. That's going to be dreadful.

Yea I understand and we can have some heuristics around when we should throw some loud warnings for this. I.E. if you have a large number of groups it can trigger a relatively inexpensive check to calculate the size of each group and then throw a warning if the min / max / mean / etc. size of the group is smaller than some threshold.

Until we have streams we could also just always warn in general that hey this is going to execute serially per group.

If we enable everything all at once (but very slowly), then we may not get concrete feature requests that we can spend the time to create proper implementations for. Instead, a user might see that a particular feature is supported, but extremely slow, and then think the library is slow.

I think it's better to wait until we have proper support for these features where we actually have a chance of being somewhat performant (groupby.rolling comes to mind).

We will 100% get issues surrounding performance, especially from @randerzander's team 馃槃. Documentation will also help to set expectations here.

Streams aren't going to save you if someone tries to do operations on 10,000 groups all of size 10.

Yup I totally get it, this is also an area that users know to avoid today as well in Pandas because it slows down as well.

Maybe I'm just tilting at windmills here, but I see this situation as very similar to the multitude of issues we've already seen about people who claim "groupby is broken" because they get different floating point results from one invocation to the next.

Groupby obviously isn't broken, they're just used to implicit behavior in Pandas because it's an inherently sequential library. It's better to force them to change their algorithm to be tolerant to floating point imprecision than to force libcudf to try and have 100% reproducible results at the cost of performance.

In the same way, consider that it may be better to force users to express their algorithms in a way that we can actually parallelize.

In the same way, consider that it may be better to force users to express their algorithms in a way that we can actually parallelize.

The problem is they can't express their algorithms currently. There may be some opportunity for parallelism within each group, it may not be the best but it's better than just blocking the functionality entirely.

In the same way, consider that it may be better to force users to express their algorithms in a way that we can actually parallelize.

The problem is they can't express their algorithms currently. There may be some opportunity for parallelism within each group, it may not be the best but it's better than just blocking the functionality entirely.

All of the examples @shwina gave can be parallelized, but not if the user writes their algorithm where they're getting each group and then operating on it directly.

groupby.transform(), groupby.filter(), groupby.rolling()

All of the examples @shwina gave can be parallelized, but not if the user writes their algorithm where they're getting each group and then operating on it directly.

I'd like to repeat my stance which might be a compromise.

So then we don't publicly expose groups right now, but do it internally for python team to be able to build features quickly, and then optimize later?

So then we don't publicly expose groups right now, but do it internally for python team to be able to build features quickly, and then optimize later?

If a user writes an algorithm that gets the groups and iterates over them instead of calling a cuDF primitive that does the same thing (but in parallel), we can't optimize that. They would have to re-write their algorithm.

All of the examples @shwina gave can be parallelized, but not if the user writes their algorithm where they're getting each group and then operating on it directly.
groupby.transform(), groupby.filter(), groupby.rolling()

All of those shouldn't necessarily need the offsets / ability to iterate over groups, 100% agreed. I was referring specifically to the case of groupby.apply() which is specifically to allow running DataFrame / Series functions against groups.

All of the examples @shwina gave can be parallelized, but not if the user writes their algorithm where they're getting each group and then operating on it directly.

To be clear, the user won't do that. They provide a function (often a lambda) that is mean to act on each group. It's up to us how we want to apply that function on the groups (in parallel or via iteration).

  • For .transform(), they provide a function returns a value of the same size as the group.
  • For .filter(), they provide a function that returns True or False
  • .apply() is more general and handles some special cases

I agree all of these can be parallelized, but until libcudf can support the above, I'm proposing we implement the APIs via iteration over the groups. We can issue a warning when users call those APIs.

I agree all of these can be parallelized, but until libcudf can support the above, I'm proposing we implement the APIs via iteration over the groups. We can issue a warning when users call those APIs.

My understanding of the original suggestion was to give Python end users direct access to the groups. What you're suggesting is different than that.

If the access to the groups is simply an implementation detail hidden behind APIs like transform/filter/apply, then that's fine with me. That gives us more of an opportunity to actually parallelize the computation in the future.

If the access to the groups is simply an implementation detail hidden behind APIs like transform/filter/apply, then that's fine with me. That gives us more of an opportunity to actually parallelize the computation in the future.

Great - sorry for not being clearer about this earlier!

give Python end users direct access to the groups

We can still do that (and Pandas does). But most users will use the .transform(), .filter() and .apply() APIs to actually compute on those groups.

But most users will use the .transform(), .filter() and .apply() APIs to actually compute on those groups.

Hmmm, we had @hummingtree ready to work on these features (groupby UDFs) and I believe we were told they aren't high enough priority right now so we should have him work on something else.

BTW, in the above conversation, I think there are different ideas about what it means to "parallelize". I think the libcudf team means to launch a single kernel that processes all the work (e.g. one thread per element; neighboring may operate on different groups). The cuDF Python team seems to be referring to launching a kernel per group on separate streams to get overlap. The former is possible and should be much more efficient in all of the cases I've heard, and we don't have any plans to expose streams in the public API of libcudf.

All of this said, I am (as always) firmly in the pragmatic camp. I don't feel strongly opposed to exposing groupby groups for internal use, as long as we clearly document (and warn where possible) the performance implications and best practices.

To be fair, the only place we want to expose grabbing a specific group or iterating over groups is specifically in the case where it's extremely non-trivial to have libcudf parallelize the operation for us. I.E. in .apply() where someone passes arbitrary DataFrame level APIs or if someone explicitly asks for a group.

Even the Pandas documentation warns that hey this is to allow flexibility but will slow down significantly and you should take a fastpath whenever possible: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html

Just to clarify, accepting functions that operate on one group as one DataFrame would mean that we would never be able to optimize it in the future. Pandas' documentation, while warning against using apply() due to performance reasons, recommends transform() which accepts Series level functions. And while Pandas doesn't mention a performance problem with that, we still have the same issue of never being able to optimize it. Therefore, in the docs, would you kindly advise against generic functions in transform(), in favour of named ones. By named functions, I mean like this: df.groupby('A')["B"].transform('sum')

Therefore, in the docs, would you kindly advise against generic functions in transform(), in favour of named ones. By named functions, I mean like this: df.groupby('A')["B"].transform('sum')

Yes, we can strongly warn / advise that while we provide these for compatibility and will make a best effort to make them "fast" that you'll be infinitely better using built in functions.

Was this page helpful?
0 / 5 - 0 ratings