Currently when you run a groupby, any nulls in the key column are dropped from the output:
import cudf
df = cudf.DataFrame()
df['id'] = [0, 1, 1, None, None, 3, 3]
df['val'] = [0, 1, 1, 2, 2, 3, 3]
df.groupby('id').val.sum()
0 0
1 2
3 6
Name: val, dtype: int64
md5-3f3a04470c204829f0bede660ae235af
df['id'] = df['id'].fillna(-1)
res = df.groupby('id').val.sum().reset_index()
res['id'] = res['id'].replace(-1, None)
res
md5-f7c84b48399eb31dbcd9727722198c28
聽 | id | val
-- | -- | --
null | 4
0 | 0
1 | 2
3 | 6
EDIT - Changed suggested param from keep_nulls to dropna to match Pandas's API per Keith's comment.
The C++ implementation already supports this behavior with the ignore_null_keys option:
https://github.com/rapidsai/cudf/blob/branch-0.10/cpp/include/cudf/groupby.hpp#L74
It's just a matter of exposing this in the Python API.
Looks like the Pandas community has had discussions surrounding this and the consensus was to use a dropna parameter: https://github.com/pandas-dev/pandas/pull/21669
I'd argue we should do the same.
Most helpful comment
Looks like the Pandas community has had discussions surrounding this and the consensus was to use a
dropnaparameter: https://github.com/pandas-dev/pandas/pull/21669I'd argue we should do the same.