Describe the bug
I am porting the following code from pandas to cudf:
_Pandas:_
file_open = file_df[file_df['activity'] == 914788726] # 'File Open'
file_copy = file_df[file_df['activity'] == 779177968] # 'File Copy'
file_write = file_df[file_df['activity'] == -1294924958] # 'File Write'
file_delete = file_df[file_df['activity'] == 1520560660] # 'File Delete'
_cuDF:_
file_open = file_df.query("activity == 914788725") # 'File Open'
file_copy = file_df.query("activity == 779177968") # 'File Copy'
file_write = file_df.query("activity == -1294924958" # 'File Write'
file_delete = file_df.query("activity == 1520560660") # 'File Delete'
(Not sure if there is a better alternative to port that code from Pandas to cuDF)
The column named _activity_ contains some null values, and the following error is displayed:
AssertionError Traceback (most recent call last)
in
----> 1 file_open = file_df.query("activity == 914788725") # 'File Open'/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/dataframe.py in query(self, expr)
1644 newdf = DataFrame()
1645 for col in self.columns:
-> 1646 newseries = self[col][selected]
1647 newdf[col] = newseries
1648 result = newdf/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/series.py in __getitem__(self, arg)
217 elif arg.dtype in [np.bool, np.bool_]:
218 selvals, selinds = columnops.column_select_by_boolmask(
--> 219 self._column, arg)
220 index = self.index.take(selinds.to_gpu_array())
221 else:/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
107 """
108 from .numerical import NumericalColumn
--> 109 assert column.null_count == 0 # We don't properly handle the boolmask yet
110 boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
111 indices = cudautils.arange(len(boolmask))AssertionError:
Steps/Code to reproduce bug
Run this code:
test_df = cudf.DataFrame({'activity': [914788725, 779177968, -1294924958, None, 1520560660]})
file_open = test_df.query("activity == 914788725").shape[0]
Expected behavior
A clear and concise description of what you expected to happen.
Environment details (please complete the following information):
DGX-1, 0.6 branch
We see the same error if there are no nulls in the list.
Hi @thomcom ,
You are right. I have replaced the nulls with the sentinel value 1, and the error persists.
file_df["activity"] = file_df["activity"].fillna(1)
Regards,
Miguel
Hi @kkraus14,
I think this issue should be labeled as ’bug’ instead of ’feature request’.
It is also failing with columns without nulls.
Thanks!
Miguel
@thomcom @miguelangel The problem exists when any column in the dataframe has nulls. The workaround of fillna on all columns before query works for me.
@shwina this should be dependent on the apply_boolean_mask supporting nulls which I believe you're already tackling.
Fixed by #1956
Most helpful comment
Hi @kkraus14,
I think this issue should be labeled as ’bug’ instead of ’feature request’.
It is also failing with columns without nulls.
Thanks!
Miguel