Cudf: [BUG] Series.__getitem__ should also fetch based on index of values

Created on 31 May 2020  路  8Comments  路  Source: rapidsai/cudf

Describe the bug
Currently Series.__getitem__ is only fetching by self.index values rather than values at an index in the series.

Steps/Code to reproduce bug

  warnings.warn(errors.NumbaWarning(msg))
>>> s = cudf.Series([10, 111], index=['a', 'b'])
>>> ps = s.to_pandas()
>>> s[0]
Traceback (most recent call last):
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 111, in _loc_to_iloc
    arg, closest=False
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3956, in find_first_value
    return self._find_first_and_last(value)[0]
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3951, in _find_first_and_last
    first = column.as_column(found_indices).find_first_value(1)
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 341, in find_first_value
    raise ValueError("value not found")
ValueError: value not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 94, in __getitem__
    arg = self._loc_to_iloc(arg)
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 115, in _loc_to_iloc
    raise IndexError("label scalar is out of bound")
IndexError: label scalar is out of bound

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/series.py", line 757, in __getitem__
    return self.loc[arg]
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 96, in __getitem__
    raise IndexError("Failed to convert index to appropirate row")
IndexError: Failed to convert index to appropirate row
>>> ps[0]
10

Expected behavior
We should match pandas behavior where __getitem__ seems to be performing both iloc and loc.

>>> ps.iloc[0]
10
>>> ps.loc['a']
10
>>> ps[0]
10
>>> ps['a']
10

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source(branch-0.15)
bug cuDF (Python)

Most helpful comment

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.

All 8 comments

My opinion is that we should just recommend the use of iloc and loc here. They are more explicit and leave no room for ambiguity.

So this was a minimal repro to the actual broader issue.

The actual issue is when we hit itertools.zip_longest: https://github.com/rapidsai/cudf/blob/branch-0.15/python/cudf/cudf/core/dataframe.py#L241

It internally uses __iter__ + __next__ or __getitem__.
If we want to leave the __getitem__ behavior as is, we will have to implement __iter__ & __next__ methods to correctly return values for all the APIs relying on using an iterator.

Gotcha. If implementing __iter__ and __next__ for Series yields the correct behaviour here, let's do that. Otherwise, we could just specialize for Series in DataFrame.__init__.

Adding __iter__ and __next__ for Series fixed the behavior. 612393a78e1b3ec3118a18936fa22569b0a36a2a

On second thought, we should definitely just special case __init__ here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.

We also probably don't want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

On second thought, we should definitely just special case __init__ here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.

+1, Agreed. We'll also need index handling to be happening so a special case for Series is required.

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

I'm super okay with not implementing this as this might as-well introduce hard to find perf issues in dask if there is any iterator construction happening there.

@kkraus14 any thoughts on this?

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.

Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.

Sounds good to me 馃憤

Was this page helpful?
0 / 5 - 0 ratings