Cudf: [BUG] tail method sometimes fail

Created on 7 Aug 2019 · 10Comments · Source: rapidsai/cudf

After running a query, I am getting ans frame. head method works fine on it, but tail method fails. This happens rarely and strongly depends on the data. Using 0.8.0+0.g8fa7bd3.dirty.

>>> ans = x.groupby(['id1'],as_index=False).agg({'v1':'sum'}).reset_index(drop=True)
>>> print(ans.head(3), flush=True)
     id1        v1
0  id001  15006850
1  id002  14994166
>>> print(ans.tail(3), flush=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 491, in __str__
    return self.to_string(nrows=nrows, ncols=ncols)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 480, in to_string
    cols[h] = self[h].values_to_string(nrows=nrows)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in values_to_string
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in <listcomp>
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 302, in __getitem__
    return self._column.element_indexing(arg)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/column.py", line 412, in element_indexing
    val = self.data[index]  # this can raise IndexError
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/buffer.py", line 149, in __getitem__
    return item.view(self.dtype)
AttributeError: 'NoneType' object has no attribute 'view'
>>> ans.dtypes
id1    object
v1      int64
dtype: object

I can provide reproducible example but it will not be minimal... the one provided in https://github.com/rapidsai/cudf/issues/2494#issue-478107791 might work after changing K=2.

bug cuDF (Python)

Source

jangorecki

All 10 comments

@jangorecki this should be fixed in the latest nightlies. This was due to nulls being improperly handled as Python None objects as opposed to numpy scalars.

kkraus14 on 16 Aug 2019

👍1

@kkraus14
I don't think the issue is fixed. In 0.8.0 it was also raising segfault.
After upgrade to 0.9.0 I am not getting segfault so far, but print of tail is still raising exception.
https://github.com/h2oai/db-benchmark/issues/102

Traceback (most recent call last):
  File "./cudf/groupby-cudf.py", line 56, in <module>
    print(ans.tail(3), flush=True)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 553, in __str__
    return self.to_string()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 550, in to_string
    return self.__repr__()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 591, in __repr__
    output = self.get_renderable_dataframe()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 582, in get_renderable_dataframe
    output._cols[col].astype("str").str.fillna("null")
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/series.py", line 1383, in astype
    return self._copy_construct(data=self._column.astype(dtype, **kwargs))
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/columnops.py", line 137, in astype
    return self.as_string_column(dtype, **kwargs)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/numerical.py", line 129, in as_string_column
    np.dtype(dev_array.dtype)
KeyError: dtype('O')

jangorecki on 25 Aug 2019

@jangorecki Does it fail always or depends on data generated for 0.9?

rgsl888prabhu on 19 Sep 2019

@rgsl888prabhu depends on the data, among 4 different cases of cardinality factor ("K") the issue manifests only in one case. You can generate exact data that cause the problem by following initial instructions.

jangorecki on 20 Sep 2019

@jangorecki I tried to reproduce using 0.9, but I wasn't able to do so. If you have that .csv file through which you can reproduce, please share it. Meanwhile, I will try to figure out the issue and reproduce it from my end.

rgsl888prabhu on 20 Sep 2019

@rgsl888prabhu I have the csv but it is 45 GB size.
csv was generated from a script so it make sense to run a script to produce the same csv rather than sharing 45 GB file.

jangorecki on 20 Sep 2019

Do you remember the random seed that you had set, I don't see it in the script.

rgsl888prabhu on 20 Sep 2019

There is a random seed set in the script:

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e9 2 0 0

jangorecki on 20 Sep 2019

Thank you @jangorecki, I am able to reproduce scenario.

rgsl888prabhu on 20 Sep 2019

👍1

Simplified code to reproduce

import cudf
import numpy as np
id1 = cudf.Series(['a', 'b'], dtype=np.object)
v1 = cudf.Series([1,2])
s = cudf.DataFrame()
s['id1'] = id1
s['v1'] = v1
print(s.tail(3))

rgsl888prabhu on 23 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings