Cudf: [FEA] Add optional support for SQL style groupby null in key column handling

Created on 19 Aug 2019  路  2Comments  路  Source: rapidsai/cudf

Currently when you run a groupby, any nulls in the key column are dropped from the output:

import cudf

df = cudf.DataFrame()

df['id'] = [0, 1, 1, None, None, 3, 3]
df['val'] = [0, 1, 1, 2, 2, 3, 3]
df.groupby('id').val.sum()
0    0
1    2
3    6
Name: val, dtype: int64



md5-3f3a04470c204829f0bede660ae235af



df['id'] = df['id'].fillna(-1)
res = df.groupby('id').val.sum().reset_index()
res['id'] = res['id'].replace(-1, None)
res



md5-f7c84b48399eb31dbcd9727722198c28



聽 | id | val
-- | -- | --
null | 4
0 | 0
1 | 2
3 | 6

EDIT - Changed suggested param from keep_nulls to dropna to match Pandas's API per Keith's comment.

cuDF (Python) feature request

Most helpful comment

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: https://github.com/pandas-dev/pandas/pull/21669

I'd argue we should do the same.

All 2 comments

The C++ implementation already supports this behavior with the ignore_null_keys option:
https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/groupby.hpp#L74

It's just a matter of exposing this in the Python API.

Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: https://github.com/pandas-dev/pandas/pull/21669

I'd argue we should do the same.

Was this page helpful?
0 / 5 - 0 ratings