Cudf: [BUG] Series/DataFrame equality checks are incorrect with StringColumns with different dictionaries

Created on 15 May 2019  路  4Comments  路  Source: rapidsai/cudf

Describe the bug
Equality checks for string columns seem to be always returning True. Something may be going on with the stringcolumn unordered_compare.

import cudf

gdf = cudf.DataFrame({'a': ['hello'], 'b': ['goodbye']})
not_equal_gdf = cudf.DataFrame({'a': ['you are'], 'b': ['welcome']})

print(gdf)
print(not_equal_gdf)
print(gdf.equals(not_equal_gdf))
print(gdf == not_equal_gdf)

print(gdf.a.equals(not_equal_gdf.a))
print(gdf.b.equals(not_equal_gdf.b))
       a        b
0  hello  goodbye
         a        b
0  you are  welcome
True
      a     b
0  True  True
True
True

cc @brhodes10

bug libcudf

All 4 comments

@felipeblazing This looks like an issue in how we're handling the string dictionary synchronization, no?

More minimal reproducer:

print(cudf.Series(['a']) == cudf.Series(['b']))

EDIT: This does not work as expected.

print(cudf.Series(['b', 'c']) == cudf.Series(['a', 'b']))

@devavret If you need to discuss what's going on under the hood here for the StringColumns just let me know. There's a couple layers of translation.

Was this page helpful?
0 / 5 - 0 ratings