Cudf: [BUG] Dtype inconsistency b/w dask and cudf categorical dtypes

Created on 7 Feb 2020 · 4Comments · Source: rapidsai/cudf

Describe the bug
Dtype inconsistency b/w dask and cudf categorical dtypes

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame({'col_1':['a','b','c']}),npartitions=2)
df['col_1'] = df['col_1'].astype("category")
print(df['col_1'].cat.codes.dtype)
print(df['col_1'].compute().cat.codes.dtype)

int8
int32

Expected behavior

They should be same.

bug cuDF (Python) dask

Source

VibhuJawa

All 4 comments

@VibhuJawa does this break things or is it just increased memory consumption that's an issue for you? I imagine Dask's codebase for typecasting to categorical is more complicated because it has potentially distributed dictionaries and so it may just use int32 to be safe, whereas cudf currently tries to use the smallest integer type possible.

kkraus14 on 7 Feb 2020

@VibhuJawa does this break things or is it just increased memory consumption that's an issue for you?

The increased memory consumption is an issue, in the workflow i has just 8 categories and it seems that the underlying dtype is int32

I imagine Dask's codebase for typecasting to categorical is more complicated because it has potentially distributed dictionaries and so it may just use int32 to be safe, whereas cudf currently tries to use the smallest integer type possible.

Isnt it the other way around here ?

import cudf

df = cudf.DataFrame({'col_1':['a','b','c']})
print(df['col_1'].astype('category').cat.codes.dtype)

int32

I would ideally like it to be int8 in the underlying structure.

Current work around:

I am using below to safely cast currently , the dataframe is small enough (before a complex merge) to make this not costly.

https://github.com/rapidsai/cudf/blob/1ead2d55562b9cd1b80568b86b8b51ab616f3cb8/python/cudf/cudf/utils/dtypes.py#L187-L197

VibhuJawa on 7 Feb 2020

```...
df = cudf.DataFrame({'col_1':['a','b','c']})
print(df['col_1'].astype('category').cat.codes.dtype)
int32
```

I would ideally like it to be int8 in the underlying structure.

This particular part should have been addressed by #5084 --- both the above two lines, and the snippet below print int8:

df = dask_cudf.from_cudf(cudf.DataFrame({'col_1':['a','b','c']}),npartitions=2)
df['col_1'] = df['col_1'].astype("category")
print(df['col_1'].compute().cat.codes.dtype)