Describe the bug
Equality checks for string columns seem to be always returning True. Something may be going on with the stringcolumn unordered_compare.
import cudf
gdf = cudf.DataFrame({'a': ['hello'], 'b': ['goodbye']})
not_equal_gdf = cudf.DataFrame({'a': ['you are'], 'b': ['welcome']})
print(gdf)
print(not_equal_gdf)
print(gdf.equals(not_equal_gdf))
print(gdf == not_equal_gdf)
print(gdf.a.equals(not_equal_gdf.a))
print(gdf.b.equals(not_equal_gdf.b))
a b
0 hello goodbye
a b
0 you are welcome
True
a b
0 True True
True
True
cc @brhodes10
@felipeblazing This looks like an issue in how we're handling the string dictionary synchronization, no?
More minimal reproducer:
print(cudf.Series(['a']) == cudf.Series(['b']))
EDIT: This does not work as expected.
print(cudf.Series(['b', 'c']) == cudf.Series(['a', 'b']))
@devavret If you need to discuss what's going on under the hood here for the StringColumns just let me know. There's a couple layers of translation.