Cudf: [BUG] Checking equality of Series with different index types results in ValueError

Created on 23 Oct 2020 · 6Comments · Source: rapidsai/cudf

Describe the bug
Checking the equality of two series with different index types results in a ValueError at this line

Steps/Code to reproduce bug

from cudf import Series
from cudf.core.index import Int64Index

s = Series([1, 2, 3])
i = Int64Index(s)
t = Series(s, index=i)

s == t  # ValueError

Expected behavior
An equality test between series should not raise an error regardless of index

Environment overview

Environment location: Docker, but also reproduced on bare metal with conda install
Method of cuDF install: Docker
- docker pull rapidsai/rapidsai-dev-nightly:0.16-cuda10.2-devel-centos7-py3.8
- docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/rapidsai-dev-nightly:0.16-cuda10.2-devel-centos7-py3.8

Additional context
Discovered this error while trying to use cuML's TargetEncoder. Can work around by slicing the dataset into any two parts (which gives them RangeIndex indexes) and applying the encoder separately on each before recombining.

bug cuDF (Python)

Source

wphicks

All 6 comments

This is consistent with pandas I believe, as the equality promise requires aligned indexes. @wphicks in your example, what would you expect to see as the output of that operation? Do you just want to know if the Series are "equal" or specifically which values are in disagreement? If the former, you could use s.equals(t)

beckernick on 23 Oct 2020

I think the issue @wphicks is pointing out is that we aren't handling the situation where one side has a RangeIndex and the other side has something like an Int64Index for example.

kkraus14 on 23 Oct 2020

cc @galipremsagar

kkraus14 on 23 Oct 2020

Do you just want to know if the Series are "equal" or specifically which values are in disagreement? If the former, you could use s.equals(t)

The above code will error to match pandas behavior where if s.index.equals(t.index) results False.

For example:

>>> from cudf import Series
>>> from cudf.core.index import Int64Index
>>> 
>>> s = Series([1, 2, 3])
>>> i = Int64Index(s)
>>> t = Series(s, index=i)
>>> 
>>> s == t
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/core/series.py", line 1492, in __eq__
    return self._binaryop(other, "eq")
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/cudf/core/series.py", line 1077, in _binaryop
    "Can only compare identically-labeled "
ValueError: Can only compare identically-labeled Series objects
>>> pt = t.to_pandas()
>>> ps = s.to_pandas()
>>> pt.index
Int64Index([1, 2, 3], dtype='int64')
>>> ps.index
RangeIndex(start=0, stop=3, step=1)
>>> pt == ps
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/pandas/core/ops/common.py", line 65, in new_method
    return method(self, other)
  File "/nvme/0/pgali/envs/cudfdev/lib/python3.7/site-packages/pandas/core/ops/__init__.py", line 365, in wrapper
    raise ValueError("Can only compare identically-labeled Series objects")
ValueError: Can only compare identically-labeled Series objects
>>> pt.index.equals(ps.index)
False
>>> ps.index.equals(pt.index)
False
>>> s.index.equals(t.index)
False
>>> t.index.equals(s.index)
False

Since both t.index(Int64Index) and s.index(RangeIndex) are of different class types - An equality check between them would not result in True

galipremsagar on 23 Oct 2020

👍1

Ah interesting! Looks like cuML is just using the equality check inappropriately, then. I'll open an issue over there.

wphicks on 23 Oct 2020

👍1

@wphicks This change was very recently introduced(https://github.com/rapidsai/cudf/issues/6499) so would suggest changing code usages in cuml appropriately.

galipremsagar on 23 Oct 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings