Cudf: [BUG] cudf concat with two groupby'ed cudfs does not yield correct result

Created on 7 Feb 2020  路  6Comments  路  Source: rapidsai/cudf

Describe the bug

Trying to concat two cuda dataframes that were obtained from two groupby operations, but the result is not correct.

Steps/Code to reproduce bug

import cudf 
cdf = cudf.DataFrame({'id4': 4*list(range(6)),
'id5': 4*list(reversed(range(6))), 'v3': 6*list(range(4))})
print(cdf)
cdf_std = cdf.groupby(['id4', 'id5'])[['v3']].std()
cdf_std
        v3
id4 id5 
0   5   1.154701
1   4   1.154701
2   3   1.154701
3   2   1.154701
4   1   1.154701
5   0   1.154701



md5-90db875d65bd40d11115dea8b1357788



cdf_med = cdf.groupby(['id4', 'id5'])[['v3']].quantile(q=0.5)

cdf_med

        v3
id4 id5 
0   5   1.0
1   4   2.0
2   3   1.0
3   2   2.0
4   1   1.0
5   0   2.0



md5-90db875d65bd40d11115dea8b1357788



result=cu.concat([cdf_med, cdf_std], axis=1)

result

             v3
id4 id5 
0   5   1.154701
1   4   1.154701
2   3   1.154701
3   2   1.154701
4   1   1.154701
5   0   1.154701



md5-dd33c365bfdf3005817e2e7ed34dc79c



pd.concat([cdf_med.to_pandas(), cdf_std.to_pandas()], axis=1)

          v3          v3
id4 id5     
0   5   1.0 1.154701
1   4   2.0 1.154701
2   3   1.0 1.154701
3   2   2.0 1.154701
4   1   1.0 1.154701
5   0   2.0 1.154701



md5-71932731607445c6489bf7fda0697442



 docker run --gpus all -it -v ${PWD}:/rapids -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/rapidsai:0.12-cuda10.1-runtime-ubuntu16.04-py3.6
bug cuDF (Python)

All 6 comments

I believe this might stem from the fact that cudf does not support multiple columns with the same column name. A quick workaround could be renaming the v3 column on one of the dataframes:

cdf_med = cdf_med.rename({"v3":"v3_median"})
result=cudf.concat([cdf_med, cdf_std], axis=1)
print(result)
         v3_median        v3
id4 id5                     
0   5          1.0  1.154701
1   4          2.0  1.154701
2   3          1.0  1.154701
3   2          2.0  1.154701
4   1          1.0  1.154701
5   0          2.0  1.154701

@ayushdg thank you. We will test that.

@kkraus14 Is this (multiple columns of same name) something we will fix? Or should this be closed?

I believe this will be fixed with #4344

@kkraus14 @harrism This issue isn't fixed.
The issue is actually that we are not able to have multiple columns with the same name.

@galipremsagar could we raise an error then instead of silently corrupting the data?

Was this page helpful?
0 / 5 - 0 ratings