Describe the bug
Trying to concat two cuda dataframes that were obtained from two groupby operations, but the result is not correct.
Steps/Code to reproduce bug
import cudf
cdf = cudf.DataFrame({'id4': 4*list(range(6)),
'id5': 4*list(reversed(range(6))), 'v3': 6*list(range(4))})
print(cdf)
cdf_std = cdf.groupby(['id4', 'id5'])[['v3']].std()
cdf_std
v3
id4 id5
0 5 1.154701
1 4 1.154701
2 3 1.154701
3 2 1.154701
4 1 1.154701
5 0 1.154701
md5-90db875d65bd40d11115dea8b1357788
cdf_med = cdf.groupby(['id4', 'id5'])[['v3']].quantile(q=0.5)
cdf_med
v3
id4 id5
0 5 1.0
1 4 2.0
2 3 1.0
3 2 2.0
4 1 1.0
5 0 2.0
md5-90db875d65bd40d11115dea8b1357788
result=cu.concat([cdf_med, cdf_std], axis=1)
result
v3
id4 id5
0 5 1.154701
1 4 1.154701
2 3 1.154701
3 2 1.154701
4 1 1.154701
5 0 1.154701
md5-dd33c365bfdf3005817e2e7ed34dc79c
pd.concat([cdf_med.to_pandas(), cdf_std.to_pandas()], axis=1)
v3 v3
id4 id5
0 5 1.0 1.154701
1 4 2.0 1.154701
2 3 1.0 1.154701
3 2 2.0 1.154701
4 1 1.0 1.154701
5 0 2.0 1.154701
md5-71932731607445c6489bf7fda0697442
docker run --gpus all -it -v ${PWD}:/rapids -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/rapidsai:0.12-cuda10.1-runtime-ubuntu16.04-py3.6
I believe this might stem from the fact that cudf does not support multiple columns with the same column name. A quick workaround could be renaming the v3 column on one of the dataframes:
cdf_med = cdf_med.rename({"v3":"v3_median"})
result=cudf.concat([cdf_med, cdf_std], axis=1)
print(result)
v3_median v3
id4 id5
0 5 1.0 1.154701
1 4 2.0 1.154701
2 3 1.0 1.154701
3 2 2.0 1.154701
4 1 1.0 1.154701
5 0 2.0 1.154701
@ayushdg thank you. We will test that.
@kkraus14 Is this (multiple columns of same name) something we will fix? Or should this be closed?
I believe this will be fixed with #4344
@kkraus14 @harrism This issue isn't fixed.
The issue is actually that we are not able to have multiple columns with the same name.
@galipremsagar could we raise an error then instead of silently corrupting the data?