Cudf: [BUG] Dask-Cudf losing index name in group-by on string column

Created on 19 Nov 2019  Â·  19Comments  Â·  Source: rapidsai/cudf

Describe the bug

Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

None
1

Expected behavior

The expected behavior will be like for non string columns.

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

1
1

Environment overview (please complete the following information)

  • Method of cuDF install: conda
cudf                      0.11.0a191119         py37_3035    rapidsai-nightly
dask-cudf                 0.11.0a191119         py37_3035    rapidsai-nightly
libcudf                   0.11.0a191119     cuda10.1_2997    rapidsai-nightly
? - Needs Triage bug

All 19 comments

I didn't get a chance to look for a fix yet, but I do believe I tracked the problem to cudf.concat:

In [1]: import cudf                                                                                                                                                               

In [2]: df1 = cudf.DataFrame({"a": [1, 2]}, index=["s1", "s2"])                                                                                                                                                                  

In [3]: df2 = cudf.DataFrame({"a": [3, 4]}, index=["s3", "s4"])                                                                                                                                                                  

In [4]: df1.index.name = "index"                                                                                                                                                                                                 

In [5]: df2.index.name = "index"                                                                                                                                                                                                 

In [6]: cudf.concat([df1, df2])                                                                                                                                                                                                 
Out[6]: 
    a
s1  1
s2  2
s3  3
s4  4

In [7]: pd.concat([df1.to_pandas() , df2.to_pandas() ])                                                                                                                                                                                                 
Out[7]: 
       a
index   
s1     1
s2     2
s3     3
s4     4

This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443

import dask_cudf
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
None
1
import dask_cudf
import cudf
​
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
1
1

@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?

@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?

That is also my suspicion - I will start digging into this soon

@rjzamora I've been poking at this a bit, it looks like 0-sized DataFrame indexes aren't handled properly in serialization --> deserialization where they end up as unnamed RangeIndexes.

i.e.

serialized = gdf._meta.serialize()
deserialized = cudf.DataFrame.deserialize(*serialized)
type(deserialized.index)

Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.

Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.

Good find - I just started looking into this also. Perhaps we can just do something like this:

def deserialize(header, frames):

    # Reconstruct the index
    index_frames = frames[: header["index_frame_count"]]

    idx_typ = pickle.loads(header["index"]["type"])
    index = idx_typ.deserialize(header["index"], index_frames)
    index_name = index.name or str(uuid.uuid4())

    # Reconstruct the columns
    column_frames = frames[header["index_frame_count"] :]

    column_names = list(pickle.loads(header["column_names"])) + [index_name]
    columns = column.deserialize_columns(header["columns"], column_frames) + [column.as_column(index)]

    result = cudf.DataFrame(dict(zip(column_names, columns))).set_index(index_name)
    result.index.name = index.name
    return result

Note that the index type seems to be preserved if you treat it as a column and then use set_index before returning.

@rjzamora That will hide the symptom here, but there's still the general problem of constructing a DataFrame with columns with a defined index. I.E.

pd.DataFrame(
    {
        'a': [],
        'b': pd.Series([], dtype='str'),
        'c': pd.Series([], index=pd.Index([], dtype='str'))
    },
    index=pd.Index([], dtype='float64')
)

You'd expect a DataFrame with an empty float64 index, but we currently create the DataFrame with the index first and then assign the columns which ends up with an empty string index.

Ah, I missunderstood - I thought thought that padas and cudf were behaving in a consistent way here. I agree that we should fix the underlying problem

@rjzamora Agreed, but it's going to be rough and ugly in the current state. I'd much rather wait for the refactor @shwina is working on currently before we try to tackle this properly, where if we can workaround it for now somehow that would be more ideal.

where if we can workaround it for now somehow that would be more ideal

Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?

Appreciate you both taking such a deep look into this

where if we can workaround it for now somehow that would be more ideal

Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?

Yes, that looks like a sufficient workaround to me for the short term.

@rjzamora @kkraus14 , are we comfortable moving forward with the proposed short term workaround?

It looks like the test in https://github.com/rjzamora/cudf/commit/d53b5b7e1a45406159f439fe573367bcc4ef8188 passes in current cudf as is

Looks like this is passing tests. Nevermind

@shwina beat me to it :). Closing the issue

@beckernick - I apologize for letting this slip without submitting a PR for the workaround, but I did assume this bug would go away after ther refactor by @shwina

Edit: Great! Glad this is resolved :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kkraus14 picture kkraus14  Â·  3Comments

ericmjl picture ericmjl  Â·  3Comments

stevencarlislewalker picture stevencarlislewalker  Â·  3Comments

henningpeters picture henningpeters  Â·  3Comments

AjayThorve picture AjayThorve  Â·  3Comments