Cudf: [BUG]cudf.DataFrame().apply() function raise 'cannot concatenate data' issue

Created on 12 Nov 2020  路  4Comments  路  Source: rapidsai/cudf

Describe the bug
cudf.DataFrame().apply() cannot produce result, issue detail please see the reproducer code below

Steps/Code to reproduce bug

issue_df = cudf.DataFrame({'A':[1,1,2,2,3,3,4,4,5,5],
                          'B':[0.01,np.nan,0.03,0.04,np.nan,0.06,0.07,0.08,0.09,1.0]})
print(issue_df)

   A     B
0  1  0.01
1  1  <NA>
2  2  0.03
3  2  0.04
4  3  <NA>
5  3  0.06
6  4  0.07
7  4  0.08
8  5  0.09
9  5   1.0

def map_fun(x):
    x = x[~x['B'].isna()]
    ticker = x.shape[0]
    num = 10
    full = ticker / num
    print(full)
    return full

result = issue_df.groupby('A').apply(lambda x: map_fun(x))

0.1
0.2
0.1
0.2
0.2
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-146-f9f92077b053> in <module>
      7     return full
      8 
----> 9 result = issue_df.groupby('A').apply(lambda x: map_fun(x))

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/groupby/groupby.py in apply(self, function)
    362             grouped_values[s:e] for s, e in zip(offsets[:-1], offsets[1:])
    363         ]
--> 364         result = cudf.concat([function(chk) for chk in chunks])
    365         if self._sort:
    366             result = result.sort_index()

/opt/conda/envs/rapids/lib/python3.7/site-packages/cudf/core/reshape.py in concat(objs, axis, ignore_index, sort)
    199             typs.add(cudf.Series)
    200         else:
--> 201             raise ValueError(f"cannot concatenate object of type {type(o)}")
    202 
    203     allowed_typs = {cudf.Series, cudf.DataFrame}

ValueError: cannot concatenate object of type <class 'float'>

Expected behavior
should give me the new dataframe, correct result can be produced by pandas

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]

    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Environment location: Docker
Method of cuDF install: Docker
If method of install is [Docker], provide docker pull & docker run commands used
sudo docker run --gpus all -it -v /data/project/demo/:/rapids/notebooks/demo/ -p 8889:8888 nvcr.io/nvidia/rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04

Please let me know if this is enough to reproduce the issue, thanks !

bug cuDF (Python)

Most helpful comment

@dominicshanshan Thanks for the reproducer. I have identified the root cause, will issue a fix.

All 4 comments

@dominicshanshan Thanks for the reproducer. I have identified the root cause, will issue a fix.

@galipremsagar figure out a trick to get it work alternatively (trick is on bold one):

def map_cal(x):
df = cudf.DataFrame(columns=['full_coverage'])
x = x[~x['signal'].isna()]
ticker = x.shape[0]
num = 4000
full = ticker / num
coverage = df.append({'full_coverage':full}, ignore_index=True)
return coverage

df_cal = cudf.DataFrame(signal_data_gpu.groupby('tradeDate').apply(lambda x: map_cal(x))).reset_index(drop=True)
print(df_cal)

tradeDate = cudf.DataFrame(signal_data_gpu.tradeDate.unique(), columns=['tradeDate']).reset_index(drop=True)
print(tradeDate)

coverage_data = df_cal.join(tradeDate, sort=True)

@galipremsagar figure out a trick to get it work alternatively (trick is on bold one):

def map_cal(x):
df = cudf.DataFrame(columns=['full_coverage'])
x = x[~x['signal'].isna()]
ticker = x.shape[0]
num = 4000
full = ticker / num
coverage = df.append({'full_coverage':full}, ignore_index=True)
return coverage

df_cal = cudf.DataFrame(signal_data_gpu.groupby('tradeDate').apply(lambda x: map_cal(x))).reset_index(drop=True)
print(df_cal)

tradeDate = cudf.DataFrame(signal_data_gpu.tradeDate.unique(), columns=['tradeDate']).reset_index(drop=True)
print(tradeDate)

coverage_data = df_cal.join(tradeDate, sort=True)

really helped, thanks a lot

@galipremsagar figure out a trick to get it work alternatively (trick is on bold one):
def map_cal(x):
df = cudf.DataFrame(columns=['full_coverage'])
x = x[~x['signal'].isna()]
ticker = x.shape[0]
num = 4000
full = ticker / num
coverage = df.append({'full_coverage':full}, ignore_index=True)
return coverage
df_cal = cudf.DataFrame(signal_data_gpu.groupby('tradeDate').apply(lambda x: map_cal(x))).reset_index(drop=True)
print(df_cal)
tradeDate = cudf.DataFrame(signal_data_gpu.tradeDate.unique(), columns=['tradeDate']).reset_index(drop=True)
print(tradeDate)
coverage_data = df_cal.join(tradeDate, sort=True)

really helped, thanks a lot

bug fixed in nightly version, thank you three thousand

Was this page helpful?
0 / 5 - 0 ratings