After running a query, I am getting ans frame. head method works fine on it, but tail method fails. This happens rarely and strongly depends on the data. Using 0.8.0+0.g8fa7bd3.dirty.
>>> ans = x.groupby(['id1'],as_index=False).agg({'v1':'sum'}).reset_index(drop=True)
>>> print(ans.head(3), flush=True)
id1 v1
0 id001 15006850
1 id002 14994166
>>> print(ans.tail(3), flush=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 491, in __str__
return self.to_string(nrows=nrows, ncols=ncols)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 480, in to_string
cols[h] = self[h].values_to_string(nrows=nrows)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in values_to_string
out = [str(v) for v in values]
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in <listcomp>
out = [str(v) for v in values]
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 302, in __getitem__
return self._column.element_indexing(arg)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/column.py", line 412, in element_indexing
val = self.data[index] # this can raise IndexError
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/buffer.py", line 149, in __getitem__
return item.view(self.dtype)
AttributeError: 'NoneType' object has no attribute 'view'
>>> ans.dtypes
id1 object
v1 int64
dtype: object
I can provide reproducible example but it will not be minimal... the one provided in https://github.com/rapidsai/cudf/issues/2494#issue-478107791 might work after changing K=2.
@jangorecki this should be fixed in the latest nightlies. This was due to nulls being improperly handled as Python None objects as opposed to numpy scalars.
@kkraus14
I don't think the issue is fixed. In 0.8.0 it was also raising segfault.
After upgrade to 0.9.0 I am not getting segfault so far, but print of tail is still raising exception.
https://github.com/h2oai/db-benchmark/issues/102
Traceback (most recent call last):
File "./cudf/groupby-cudf.py", line 56, in <module>
print(ans.tail(3), flush=True)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 553, in __str__
return self.to_string()
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 550, in to_string
return self.__repr__()
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 591, in __repr__
output = self.get_renderable_dataframe()
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 582, in get_renderable_dataframe
output._cols[col].astype("str").str.fillna("null")
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/series.py", line 1383, in astype
return self._copy_construct(data=self._column.astype(dtype, **kwargs))
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/columnops.py", line 137, in astype
return self.as_string_column(dtype, **kwargs)
File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/numerical.py", line 129, in as_string_column
np.dtype(dev_array.dtype)
KeyError: dtype('O')
@jangorecki Does it fail always or depends on data generated for 0.9?
@rgsl888prabhu depends on the data, among 4 different cases of cardinality factor ("K") the issue manifests only in one case. You can generate exact data that cause the problem by following initial instructions.
@jangorecki I tried to reproduce using 0.9, but I wasn't able to do so. If you have that .csv file through which you can reproduce, please share it. Meanwhile, I will try to figure out the issue and reproduce it from my end.
@rgsl888prabhu I have the csv but it is 45 GB size.
csv was generated from a script so it make sense to run a script to produce the same csv rather than sharing 45 GB file.
Do you remember the random seed that you had set, I don't see it in the script.
There is a random seed set in the script:
wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e9 2 0 0
Thank you @jangorecki, I am able to reproduce scenario.
Simplified code to reproduce
import cudf
import numpy as np
id1 = cudf.Series(['a', 'b'], dtype=np.object)
v1 = cudf.Series([1,2])
s = cudf.DataFrame()
s['id1'] = id1
s['v1'] = v1
print(s.tail(3))