Cudf: [FEA] Support datetime columns in multi-column groupby aggregations

Created on 26 Jun 2019  路  10Comments  路  Source: rapidsai/cudf

Is your feature request related to a problem? Please describe.
cuDF should support datetime columns in multi-column groupby aggregations

Describe the solution you'd like
Suppose we have a dataset that looks like

                timestamp  a  b  c  score
0 2019-06-24T14:05:54.894  0  0  0     10
1 2019-06-25T23:41:58.494  0  0  1     20
2 2019-06-24T06:17:30.909  0  1  0     30
3 2019-06-27T21:00:31.250  0  1  1     40
4 2019-06-23T14:02:41.869  1  0  0     50
5 2019-06-27T16:10:31.940  1  0  1     60
6 2019-06-24T16:33:20.057  1  1  0     70
7 2019-06-26T19:57:00.787  1  1  1     80

We want to perform a multi-groupby on the timestamp and another column, then aggregate over the score. This should look like

data.groupby(['timestamp', 'a']).agg({'score': ['mean', 'count']}).compute().to_pandas()

But this gives

AttributeError: 'DatetimeColumn' object has no attribute 'unique'

This is a common usecase which is supported with non-datetime columns, i.e

data.groupby(['b', 'a']).agg({'score': ['mean', 'count']}).compute().to_pandas()

gives

     (score, mean)  (score, count)
b a
0 0           15.0               2
  1           55.0               2
1 0           35.0               2
  1           75.0               2

Describe alternatives you've considered
A current workaround is to convert the timestamps to int64, or only use part of the timestamp, such as

data['timestamp'] = data['timestamp'].dt.day
cuDF (Python) feature request

All 10 comments

This looks to be purely in the python layer. There's nothing in the C++ layer that prevents using a date32 or date64 as a key.

@RFinkelberg Can you check if passing as_index=False to the groupby call resolves the issue? I believe this is isolated to the MultiIndex code and should hopefully be a relatively easy fix.

@RFinkelberg based on a quick look, I actually think the implementation for unique_segments/unique/unique_count in NumericalColumn would work on datetime columns if you copied it verbatim to the DateTimeColumn class. It looks like all of our column types also use the same unique_segments and unique_count function, and there is also a lot of repetition between unique, unique_counts, and value_counts.

Rather than add this same code again for the DatetimeColumn for a third time, we should consider factoring it out.

EDIT: This isn't directly related to this issue, but gets at the error Roy posted

@kkraus14 You're right that as_index=False solves the issue, but this actually came from a dask_cudf use case, and dask_cudf's groupby doesn't support as_index=False

Yep, just confirms the issue is isolated to MultiIndex code, likely in what @beckernick posted above and should be a relatively straightforward fix.

@thomcom is this fixed with #2221?

No, sorry. I didn't know this was in P0, it might be a fairly quick fix I shall look into it today.

So, it is fixed, but the resulting dataframe can't be printed with the help of to_pandas for a different bug. I'm currently deep in to_pandas type work and I'll add a test for this example and determine why it can't be converted for display.

Yup

In [6]: df.groupby(['timestamp', 'a']).agg({'score': ['mean', 'count']})
Out[6]:
             score
              mean count
timestamp  a
1970-01-01 0  25.0     4
           1  65.0     4

I haven't decided how to get the fix in - that is, alone or as part of the fix for #406 and #489.

Great, resolving.

Was this page helpful?
0 / 5 - 0 ratings