Describe the bug
Dtype inconsistency b/w dask and cudf categorical dtypes
Steps/Code to reproduce bug
import dask_cudf
import cudf
df = dask_cudf.from_cudf(cudf.DataFrame({'col_1':['a','b','c']}),npartitions=2)
df['col_1'] = df['col_1'].astype("category")
print(df['col_1'].cat.codes.dtype)
print(df['col_1'].compute().cat.codes.dtype)
int8
int32
Expected behavior
They should be same.
@VibhuJawa does this break things or is it just increased memory consumption that's an issue for you? I imagine Dask's codebase for typecasting to categorical is more complicated because it has potentially distributed dictionaries and so it may just use int32 to be safe, whereas cudf currently tries to use the smallest integer type possible.
@VibhuJawa does this break things or is it just increased memory consumption that's an issue for you?
The increased memory consumption is an issue, in the workflow i has just 8 categories and it seems that the underlying dtype is int32
I imagine Dask's codebase for typecasting to categorical is more complicated because it has potentially distributed dictionaries and so it may just use
int32to be safe, whereas cudf currently tries to use the smallest integer type possible.
Isnt it the other way around here ?
import cudf
df = cudf.DataFrame({'col_1':['a','b','c']})
print(df['col_1'].astype('category').cat.codes.dtype)
int32
I would ideally like it to be int8 in the underlying structure.
Current work around:
I am using below to safely cast currently , the dataframe is small enough (before a complex merge) to make this not costly.
```...
df = cudf.DataFrame({'col_1':['a','b','c']})
print(df['col_1'].astype('category').cat.codes.dtype)int32
```I would ideally like it to be
int8in the underlying structure.
This particular part should have been addressed by #5084 --- both the above two lines, and the snippet below print int8:
df = dask_cudf.from_cudf(cudf.DataFrame({'col_1':['a','b','c']}),npartitions=2) df['col_1'] = df['col_1'].astype("category") print(df['col_1'].compute().cat.codes.dtype)
Thanks @philtrade! Going to close this as resolved by #5084.