Describe the bug
Currently Series.__getitem__ is only fetching by self.index values rather than values at an index in the series.
Steps/Code to reproduce bug
warnings.warn(errors.NumbaWarning(msg))
>>> s = cudf.Series([10, 111], index=['a', 'b'])
>>> ps = s.to_pandas()
>>> s[0]
Traceback (most recent call last):
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 111, in _loc_to_iloc
arg, closest=False
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3956, in find_first_value
return self._find_first_and_last(value)[0]
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 3951, in _find_first_and_last
first = column.as_column(found_indices).find_first_value(1)
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/numerical.py", line 341, in find_first_value
raise ValueError("value not found")
ValueError: value not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 94, in __getitem__
arg = self._loc_to_iloc(arg)
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 115, in _loc_to_iloc
raise IndexError("label scalar is out of bound")
IndexError: label scalar is out of bound
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/series.py", line 757, in __getitem__
return self.loc[arg]
File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/indexing.py", line 96, in __getitem__
raise IndexError("Failed to convert index to appropirate row")
IndexError: Failed to convert index to appropirate row
>>> ps[0]
10
Expected behavior
We should match pandas behavior where __getitem__ seems to be performing both iloc and loc.
>>> ps.iloc[0]
10
>>> ps.loc['a']
10
>>> ps[0]
10
>>> ps['a']
10
Environment overview (please complete the following information)
branch-0.15)My opinion is that we should just recommend the use of iloc and loc here. They are more explicit and leave no room for ambiguity.
So this was a minimal repro to the actual broader issue.
The actual issue is when we hit itertools.zip_longest: https://github.com/rapidsai/cudf/blob/branch-0.15/python/cudf/cudf/core/dataframe.py#L241
It internally uses __iter__ + __next__ or __getitem__.
If we want to leave the __getitem__ behavior as is, we will have to implement __iter__ & __next__ methods to correctly return values for all the APIs relying on using an iterator.
Gotcha. If implementing __iter__ and __next__ for Series yields the correct behaviour here, let's do that. Otherwise, we could just specialize for Series in DataFrame.__init__.
Adding __iter__ and __next__ for Series fixed the behavior. 612393a78e1b3ec3118a18936fa22569b0a36a2a
On second thought, we should definitely just special case __init__ here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.
We also probably don't want to implement __iter__ and __next__ for Series. If someone wants to iterate over the values of a cudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.
On second thought, we should definitely just special case
__init__here for Series. Constructing the DataFrame by iterating over the Series will be abysmally slow because it will be tons of DtoH copies.
+1, Agreed. We'll also need index handling to be happening so a special case for Series is required.
We also probably _don't_ want to implement
__iter__and__next__forSeries. If someone wants to iterate over the values of acudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.
I'm super okay with not implementing this as this might as-well introduce hard to find perf issues in dask if there is any iterator construction happening there.
@kkraus14 any thoughts on this?
We also probably _don't_ want to implement
__iter__and__next__forSeries. If someone wants to iterate over the values of acudf.Series, they're much better off copying to a host object (Pandas/PyArrow) and iterating over that.
Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.
Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.
Sounds good to me 馃憤
Most helpful comment
Strong +1 to this. I'd even argue we should make those explicitly throw in a clear way that indicate that you shouldn't use iteration and if you absolutely need iteration, convert to a PyArrow / Pandas / Numpy object first.