Is your feature request related to a problem? Please describe.
cuDF should support datetime columns in multi-column groupby aggregations
Describe the solution you'd like
Suppose we have a dataset that looks like
timestamp a b c score
0 2019-06-24T14:05:54.894 0 0 0 10
1 2019-06-25T23:41:58.494 0 0 1 20
2 2019-06-24T06:17:30.909 0 1 0 30
3 2019-06-27T21:00:31.250 0 1 1 40
4 2019-06-23T14:02:41.869 1 0 0 50
5 2019-06-27T16:10:31.940 1 0 1 60
6 2019-06-24T16:33:20.057 1 1 0 70
7 2019-06-26T19:57:00.787 1 1 1 80
We want to perform a multi-groupby on the timestamp and another column, then aggregate over the score. This should look like
data.groupby(['timestamp', 'a']).agg({'score': ['mean', 'count']}).compute().to_pandas()
But this gives
AttributeError: 'DatetimeColumn' object has no attribute 'unique'
This is a common usecase which is supported with non-datetime columns, i.e
data.groupby(['b', 'a']).agg({'score': ['mean', 'count']}).compute().to_pandas()
gives
(score, mean) (score, count)
b a
0 0 15.0 2
1 55.0 2
1 0 35.0 2
1 75.0 2
Describe alternatives you've considered
A current workaround is to convert the timestamps to int64, or only use part of the timestamp, such as
data['timestamp'] = data['timestamp'].dt.day
This looks to be purely in the python layer. There's nothing in the C++ layer that prevents using a date32 or date64 as a key.
@RFinkelberg Can you check if passing as_index=False to the groupby call resolves the issue? I believe this is isolated to the MultiIndex code and should hopefully be a relatively easy fix.
@RFinkelberg based on a quick look, I actually think the implementation for unique_segments/unique/unique_count in NumericalColumn would work on datetime columns if you copied it verbatim to the DateTimeColumn class. It looks like all of our column types also use the same unique_segments and unique_count function, and there is also a lot of repetition between unique, unique_counts, and value_counts.
Rather than add this same code again for the DatetimeColumn for a third time, we should consider factoring it out.
EDIT: This isn't directly related to this issue, but gets at the error Roy posted
@kkraus14 You're right that as_index=False solves the issue, but this actually came from a dask_cudf use case, and dask_cudf's groupby doesn't support as_index=False
Yep, just confirms the issue is isolated to MultiIndex code, likely in what @beckernick posted above and should be a relatively straightforward fix.
@thomcom is this fixed with #2221?
No, sorry. I didn't know this was in P0, it might be a fairly quick fix I shall look into it today.
So, it is fixed, but the resulting dataframe can't be printed with the help of to_pandas for a different bug. I'm currently deep in to_pandas type work and I'll add a test for this example and determine why it can't be converted for display.
Yup
In [6]: df.groupby(['timestamp', 'a']).agg({'score': ['mean', 'count']})
Out[6]:
score
mean count
timestamp a
1970-01-01 0 25.0 4
1 65.0 4
I haven't decided how to get the fix in - that is, alone or as part of the fix for #406 and #489.
Great, resolving.