Cudf: [BUG] Series.getitem should also fetch based on index of values

Created on 31 May 2020 · 8Comments · Source: rapidsai/cudf

Describe the bug
Currently Series.__getitem__ is only fetching by self.index values rather than values at an index in the series.

Steps/Code to reproduce bug

  warnings.warn(errors.NumbaWarning(msg))
>>> s = cudf.Series([10, 111], index=['a', 'b'])
>>> ps = s.to_pandas()
>>> s[0]
Traceback (most recent call last):
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 111, in _loc_to_iloc
    arg, closest=False
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3956, in find_first_value
    return self._find_first_and_last(value)[0]
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3951, in _find_first_and_last
    first = column.as_column(found_indices).find_first_value(1)
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 341, in find_first_value
    raise ValueError("value not found")
ValueError: value not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 94, in __getitem__
    arg = self._loc_to_iloc(arg)
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 115, in _loc_to_iloc
    raise IndexError("label scalar is out of bound")
IndexError: label scalar is out of bound

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/series.py", line 757, in __getitem__
    return self.loc[arg]
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 96, in __getitem__
    raise IndexError("Failed to convert index to appropirate row")
IndexError: Failed to convert index to appropirate row
>>> ps[0]
10

Expected behavior
We should match pandas behavior where __getitem__ seems to be performing both iloc and loc.

>>> ps.iloc[0]
10
>>> ps.loc['a']
10
>>> ps[0]
10
>>> ps['a']
10

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: from source(branch-0.15)

bug cuDF (Python)

Source

galipremsagar

Most helpful comment

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.

kkraus14 on 1 Jun 2020

👍2

All 8 comments

My opinion is that we should just recommend the use of iloc and loc here. They are more explicit and leave no room for ambiguity.

shwina on 31 May 2020

So this was a minimal repro to the actual broader issue.

The actual issue is when we hit itertools.zip_longest: https://github.com/rapidsai/cudf/blob/branch-0.15/python/cudf/cudf/core/dataframe.py#L241

It internally uses __iter__ + __next__ or __getitem__.
If we want to leave the __getitem__ behavior as is, we will have to implement __iter__ & __next__ methods to correctly return values for all the APIs relying on using an iterator.

galipremsagar on 31 May 2020

Gotcha. If implementing __iter__ and __next__ for Series yields the correct behaviour here, let's do that. Otherwise, we could just specialize for Series in DataFrame.__init__.

shwina on 31 May 2020

Adding __iter__ and __next__ for Series fixed the behavior. 612393a78e1b3ec3118a18936fa22569b0a36a2a

galipremsagar on 31 May 2020

On second thought, we should definitely just special case __init__ here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.

We also probably don't want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

shwina on 1 Jun 2020

On second thought, we should definitely just special case __init__ here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.

+1, Agreed. We'll also need index handling to be happening so a special case for Series is required.

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

I'm super okay with not implementing this as this might as-well introduce hard to find perf issues in dask if there is any iterator construction happening there.

@kkraus14 any thoughts on this?

galipremsagar on 1 Jun 2020

We also probably _don't_ want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.

kkraus14 on 1 Jun 2020

👍2

Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.

Sounds good to me 👍

galipremsagar on 1 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] mean() fails on groupby

AjayThorve · 3Comments

[BUG] to_orc fails if one of the columns is a string column

ayushdg · 3Comments

automatically infer dtypes when using read_csv

ericmjl · 3Comments

[BUG] "NaT" string literal needs to be recognized as `null` in to_timestamps method

galipremsagar · 3Comments

[FEA] Update Python implementation of fillna to use libcudf function

kkraus14 · 3Comments

Cudf: [BUG] Series.__getitem__ should also fetch based on index of values

Most helpful comment

All 8 comments

Related issues

Cudf: [BUG] Series.getitem should also fetch based on index of values