Cudf: [BUG] Dask-Cudf losing index name in group-by on string column

Created on 19 Nov 2019 · 19Comments · Source: rapidsai/cudf

Describe the bug

Dask seems to be loosing index name after groupby on string columns . This causes bugs downstream.

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

None
1

Expected behavior

The expected behavior will be like for non string columns.

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':[1,2]*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)

Output

1
1

Environment overview (please complete the following information)

Method of cuDF install: conda

cudf                      0.11.0a191119         py37_3035    rapidsai-nightly
dask-cudf                 0.11.0a191119         py37_3035    rapidsai-nightly
libcudf                   0.11.0a191119     cuda10.1_2997    rapidsai-nightly

? - Needs Triage bug

Source

VibhuJawa

All 19 comments

I didn't get a chance to look for a fix yet, but I do believe I tracked the problem to cudf.concat:

In [1]: import cudf                                                                                                                                                               

In [2]: df1 = cudf.DataFrame({"a": [1, 2]}, index=["s1", "s2"])                                                                                                                                                                  

In [3]: df2 = cudf.DataFrame({"a": [3, 4]}, index=["s3", "s4"])                                                                                                                                                                  

In [4]: df1.index.name = "index"                                                                                                                                                                                                 

In [5]: df2.index.name = "index"                                                                                                                                                                                                 

In [6]: cudf.concat([df1, df2])                                                                                                                                                                                                 
Out[6]: 
    a
s1  1
s2  2
s3  3
s4  4

In [7]: pd.concat([df1.to_pandas() , df2.to_pandas() ])                                                                                                                                                                                                 
Out[7]: 
       a
index   
s1     1
s2     2
s3     3
s4     4

rjzamora on 21 Nov 2019

This appears not be solved yet for the distributed scheduler, which may be related to the behavior in #3443

import dask_cudf
import cudf
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
None
1

import dask_cudf
import cudf

df = dask_cudf.from_cudf(cudf.DataFrame(data = {'1':['s1','s2']*4,'2':[0,1]*4}),npartitions=2)
gdf = df.groupby('1').agg({'2':'count'})
print(gdf.compute().index.name)
print(gdf.index.name)
1
1

beckernick on 22 Nov 2019

👍1

@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?

kkraus14 on 22 Nov 2019

👍1

@rjzamora since this only occurs if there's multiple partitions, think it could be serialization / deserialization related?

That is also my suspicion - I will start digging into this soon

rjzamora on 22 Nov 2019

@rjzamora I've been poking at this a bit, it looks like 0-sized DataFrame indexes aren't handled properly in serialization --> deserialization where they end up as unnamed RangeIndexes.

kkraus14 on 22 Nov 2019

i.e.

serialized = gdf._meta.serialize()
deserialized = cudf.DataFrame.deserialize(*serialized)
type(deserialized.index)

kkraus14 on 22 Nov 2019

Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.

kkraus14 on 22 Nov 2019

Well diving in deeper, it seems like the issue is actually in how we reconstruct a DataFrame, where if an index is empty, Pandas and cuDF both override it to a RangeIndex when assigning columns. We likely just need to reach inside the object more for deserialization, blegh.

Good find - I just started looking into this also. Perhaps we can just do something like this:

def deserialize(header, frames):

    # Reconstruct the index
    index_frames = frames[: header["index_frame_count"]]

    idx_typ = pickle.loads(header["index"]["type"])
    index = idx_typ.deserialize(header["index"], index_frames)
    index_name = index.name or str(uuid.uuid4())

    # Reconstruct the columns
    column_frames = frames[header["index_frame_count"] :]

    column_names = list(pickle.loads(header["column_names"])) + [index_name]
    columns = column.deserialize_columns(header["columns"], column_frames) + [column.as_column(index)]

    result = cudf.DataFrame(dict(zip(column_names, columns))).set_index(index_name)
    result.index.name = index.name
    return result

Note that the index type seems to be preserved if you treat it as a column and then use set_index before returning.

rjzamora on 22 Nov 2019

@rjzamora That will hide the symptom here, but there's still the general problem of constructing a DataFrame with columns with a defined index. I.E.

pd.DataFrame(
    {
        'a': [],
        'b': pd.Series([], dtype='str'),
        'c': pd.Series([], index=pd.Index([], dtype='str'))
    },
    index=pd.Index([], dtype='float64')
)

You'd expect a DataFrame with an empty float64 index, but we currently create the DataFrame with the index first and then assign the columns which ends up with an empty string index.

kkraus14 on 22 Nov 2019

Ah, I missunderstood - I thought thought that padas and cudf were behaving in a consistent way here. I agree that we should fix the underlying problem

rjzamora on 22 Nov 2019

@rjzamora Agreed, but it's going to be rough and ugly in the current state. I'd much rather wait for the refactor @shwina is working on currently before we try to tackle this properly, where if we can workaround it for now somehow that would be more ideal.

kkraus14 on 22 Nov 2019

where if we can workaround it for now somehow that would be more ideal

Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?

rjzamora on 22 Nov 2019

Appreciate you both taking such a deep look into this

beckernick on 25 Nov 2019

where if we can workaround it for now somehow that would be more ideal

Is something like d53b5b7 a reasonable short-term fix for the serialization symptom? Or, are you suggesting a short-term workaround in the DataFrame constructor?

Yes, that looks like a sufficient workaround to me for the short term.

kkraus14 on 2 Dec 2019

@rjzamora @kkraus14 , are we comfortable moving forward with the proposed short term workaround?

beckernick on 17 Jan 2020

It looks like the test in https://github.com/rjzamora/cudf/commit/d53b5b7e1a45406159f439fe573367bcc4ef8188 passes in current cudf as is

shwina on 17 Jan 2020

Looks like this is passing tests. Nevermind

beckernick on 17 Jan 2020

@shwina beat me to it :). Closing the issue

beckernick on 17 Jan 2020

@beckernick - I apologize for letting this slip without submitting a PR for the workaround, but I did assume this bug would go away after ther refactor by @shwina

Edit: Great! Glad this is resolved :)

rjzamora on 17 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings