Describe the bug
Checking the equality of two series with different index types results in a ValueError at this line
Steps/Code to reproduce bug
from cudf import Series
from cudf.core.index import Int64Index
s = Series([1, 2, 3])
i = Int64Index(s)
t = Series(s, index=i)
s == t # ValueError
Expected behavior
An equality test between series should not raise an error regardless of index
Environment overview
docker pull rapidsai/rapidsai-dev-nightly:0.16-cuda10.2-devel-centos7-py3.8docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai-dev-nightly:0.16-cuda10.2-devel-centos7-py3.8Additional context
Discovered this error while trying to use cuML's TargetEncoder. Can work around by slicing the dataset into any two parts (which gives them RangeIndex indexes) and applying the encoder separately on each before recombining.
This is consistent with pandas I believe, as the equality promise requires aligned indexes. @wphicks in your example, what would you expect to see as the output of that operation? Do you just want to know if the Series are "equal" or specifically which values are in disagreement? If the former, you could use s.equals(t)
I think the issue @wphicks is pointing out is that we aren't handling the situation where one side has a RangeIndex and the other side has something like an Int64Index for example.
cc @galipremsagar
Do you just want to know if the Series are "equal" or specifically which values are in disagreement? If the former, you could use
s.equals(t)
+1
The above code will error to match pandas behavior where if s.index.equals(t.index) results False.
For example:
>>> from cudf import Series
>>> from cudf.core.index import Int64Index
>>>
>>> s = Series([1, 2, 3])
>>> i = Int64Index(s)
>>> t = Series(s, index=i)
>>>
>>> s == t
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/core/series.py", line 1492, in __eq__
return self._binaryop(other, "eq")
File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/contextlib.py", line 74, in inner
return func(*args, **kwds)
File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/core/series.py", line 1077, in _binaryop
"Can only compare identically-labeled "
ValueError: Can only compare identically-labeled Series objects
>>> pt = t.to_pandas()
>>> ps = s.to_pandas()
>>> pt.index
Int64Index([1, 2, 3], dtype='int64')
>>> ps.index
RangeIndex(start=0, stop=3, step=1)
>>> pt == ps
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/pandas/core/ops/common.py", line 65, in new_method
return method(self, other)
File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 365, in wrapper
raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
>>> pt.index.equals(ps.index)
False
>>> ps.index.equals(pt.index)
False
>>> s.index.equals(t.index)
False
>>> t.index.equals(s.index)
False
Since both t.index(Int64Index) and s.index(RangeIndex) are of different class types - An equality check between them would not result in True
Ah interesting! Looks like cuML is just using the equality check inappropriately, then. I'll open an issue over there.
@wphicks This change was very recently introduced(https://github.com/rapidsai/cudf/issues/6499) so would suggest changing code usages in cuml appropriately.