Cudf: [BUG] tail method sometimes fail

Created on 7 Aug 2019  路  10Comments  路  Source: rapidsai/cudf

After running a query, I am getting ans frame. head method works fine on it, but tail method fails. This happens rarely and strongly depends on the data. Using 0.8.0+0.g8fa7bd3.dirty.

>>> ans = x.groupby(['id1'],as_index=False).agg({'v1':'sum'}).reset_index(drop=True)
>>> print(ans.head(3), flush=True)
     id1        v1
0  id001  15006850
1  id002  14994166
>>> print(ans.tail(3), flush=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 491, in __str__
    return self.to_string(nrows=nrows, ncols=ncols)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 480, in to_string
    cols[h] = self[h].values_to_string(nrows=nrows)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in values_to_string
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in <listcomp>
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 302, in __getitem__
    return self._column.element_indexing(arg)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/column.py", line 412, in element_indexing
    val = self.data[index]  # this can raise IndexError
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/buffer.py", line 149, in __getitem__
    return item.view(self.dtype)
AttributeError: 'NoneType' object has no attribute 'view'
>>> ans.dtypes
id1    object
v1      int64
dtype: object

I can provide reproducible example but it will not be minimal... the one provided in https://github.com/rapidsai/cudf/issues/2494#issue-478107791 might work after changing K=2.

bug cuDF (Python)

All 10 comments

@jangorecki this should be fixed in the latest nightlies. This was due to nulls being improperly handled as Python None objects as opposed to numpy scalars.

@kkraus14
I don't think the issue is fixed. In 0.8.0 it was also raising segfault.
After upgrade to 0.9.0 I am not getting segfault so far, but print of tail is still raising exception.
https://github.com/h2oai/db-benchmark/issues/102

Traceback (most recent call last):
  File "./cudf/groupby-cudf.py", line 56, in <module>
    print(ans.tail(3), flush=True)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 553, in __str__
    return self.to_string()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 550, in to_string
    return self.__repr__()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 591, in __repr__
    output = self.get_renderable_dataframe()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 582, in get_renderable_dataframe
    output._cols[col].astype("str").str.fillna("null")
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/series.py", line 1383, in astype
    return self._copy_construct(data=self._column.astype(dtype, **kwargs))
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/columnops.py", line 137, in astype
    return self.as_string_column(dtype, **kwargs)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/numerical.py", line 129, in as_string_column
    np.dtype(dev_array.dtype)
KeyError: dtype('O')

@jangorecki Does it fail always or depends on data generated for 0.9?

@rgsl888prabhu depends on the data, among 4 different cases of cardinality factor ("K") the issue manifests only in one case. You can generate exact data that cause the problem by following initial instructions.

@jangorecki I tried to reproduce using 0.9, but I wasn't able to do so. If you have that .csv file through which you can reproduce, please share it. Meanwhile, I will try to figure out the issue and reproduce it from my end.

@rgsl888prabhu I have the csv but it is 45 GB size.
csv was generated from a script so it make sense to run a script to produce the same csv rather than sharing 45 GB file.

Do you remember the random seed that you had set, I don't see it in the script.

There is a random seed set in the script:

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e9 2 0 0

Thank you @jangorecki, I am able to reproduce scenario.

Simplified code to reproduce

import cudf
import numpy as np
id1 = cudf.Series(['a', 'b'], dtype=np.object)
v1 = cudf.Series([1,2])
s = cudf.DataFrame()
s['id1'] = id1
s['v1'] = v1
print(s.tail(3))
Was this page helpful?
0 / 5 - 0 ratings