Cudf: [FEA] Add optional support for SQL style groupby null in key column handling

Created on 19 Aug 2019 · 2Comments · Source: rapidsai/cudf

Currently when you run a groupby, any nulls in the key column are dropped from the output:

import cudf

df = cudf.DataFrame()

df['id'] = [0, 1, 1, None, None, 3, 3]
df['val'] = [0, 1, 1, 2, 2, 3, 3]
df.groupby('id').val.sum()

0    0
1    2
3    6
Name: val, dtype: int64



md5-3f3a04470c204829f0bede660ae235af



df['id'] = df['id'].fillna(-1)
res = df.groupby('id').val.sum().reset_index()
res['id'] = res['id'].replace(-1, None)
res



md5-f7c84b48399eb31dbcd9727722198c28



  | id | val
-- | -- | --
null | 4
0 | 0
1 | 2
3 | 6

EDIT - Changed suggested param from keep_nulls to dropna to match Pandas's API per Keith's comment.

cuDF (Python) feature request

Source

randerzander

Most helpful comment

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: https://github.com/pandas-dev/pandas/pull/21669

I'd argue we should do the same.

kkraus14 on 19 Aug 2019

👍2

All 2 comments

The C++ implementation already supports this behavior with the ignore_null_keys option:
https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/groupby.hpp#L74

It's just a matter of exposing this in the Python API.

jrhemstad on 19 Aug 2019

👍2

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: https://github.com/pandas-dev/pandas/pull/21669

I'd argue we should do the same.

kkraus14 on 19 Aug 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings