Describe the bug
Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.
Steps/Code to reproduce bug
import dask_cudf
import cudf
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
Output
None
1
Expected behavior
The expected behavior will be like for non string columns.
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
Output
1
1
Environment overview (please complete the following information)
cudf 0.11.0a191119 py37_3035 rapidsai-nightly
dask-cudf 0.11.0a191119 py37_3035 rapidsai-nightly
libcudf 0.11.0a191119 cuda10.1_2997 rapidsai-nightly
I didn't get a chance to look for a fix yet, but I do believe I tracked the problem to cudf.concat:
In [1]: import cudf
In [2]: df1 = cudf.DataFrame({"a": [1, 2]}, index=["s1", "s2"])
In [3]: df2 = cudf.DataFrame({"a": [3, 4]}, index=["s3", "s4"])
In [4]: df1.index.name = "index"
In [5]: df2.index.name = "index"
In [6]: cudf.concat([df1, df2])
Out[6]:
a
s1 1
s2 2
s3 3
s4 4
In [7]: pd.concat([df1.to_pandas() , df2.to_pandas() ])
Out[7]:
a
index
s1 1
s2 2
s3 3
s4 4
This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443
import dask_cudf
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
None
1
import dask_cudf
import cudf
​
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
1
1
@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?
@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?
That is also my suspicion - I will start digging into this soon
@rjzamora I've been poking at this a bit, it looks like 0-sized DataFrame indexes aren't handled properly in serialization --> deserialization where they end up as unnamed RangeIndexes.
i.e.
serialized = gdf._meta.serialize()
deserialized = cudf.DataFrame.deserialize(*serialized)
type(deserialized.index)
Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.
Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.
Good find - I just started looking into this also. Perhaps we can just do something like this:
def deserialize(header, frames):
# Reconstruct the index
index_frames = frames[: header["index_frame_count"]]
idx_typ = pickle.loads(header["index"]["type"])
index = idx_typ.deserialize(header["index"], index_frames)
index_name = index.name or str(uuid.uuid4())
# Reconstruct the columns
column_frames = frames[header["index_frame_count"] :]
column_names = list(pickle.loads(header["column_names"])) + [index_name]
columns = column.deserialize_columns(header["columns"], column_frames) + [column.as_column(index)]
result = cudf.DataFrame(dict(zip(column_names, columns))).set_index(index_name)
result.index.name = index.name
return result
Note that the index type seems to be preserved if you treat it as a column and then use set_index before returning.
@rjzamora That will hide the symptom here, but there's still the general problem of constructing a DataFrame with columns with a defined index. I.E.
pd.DataFrame(
{
'a': [],
'b': pd.Series([], dtype='str'),
'c': pd.Series([], index=pd.Index([], dtype='str'))
},
index=pd.Index([], dtype='float64')
)
You'd expect a DataFrame with an empty float64 index, but we currently create the DataFrame with the index first and then assign the columns which ends up with an empty string index.
Ah, I missunderstood - I thought thought that padas and cudf were behaving in a consistent way here. I agree that we should fix the underlying problem
@rjzamora Agreed, but it's going to be rough and ugly in the current state. I'd much rather wait for the refactor @shwina is working on currently before we try to tackle this properly, where if we can workaround it for now somehow that would be more ideal.
where if we can workaround it for now somehow that would be more ideal
Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?
Appreciate you both taking such a deep look into this
where if we can workaround it for now somehow that would be more ideal
Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?
Yes, that looks like a sufficient workaround to me for the short term.
@rjzamora @kkraus14 , are we comfortable moving forward with the proposed short term workaround?
It looks like the test in https://github.com/rjzamora/cudf/commit/d53b5b7e1a45406159f439fe573367bcc4ef8188 passes in current cudf as is
Looks like this is passing tests. Nevermind
@shwina beat me to it :). Closing the issue
@beckernick - I apologize for letting this slip without submitting a PR for the workaround, but I did assume this bug would go away after ther refactor by @shwina
Edit: Great! Glad this is resolved :)