Cudf: [FEA] Support nulls in `query`

Created on 14 Feb 2019  Â·  6Comments  Â·  Source: rapidsai/cudf

Describe the bug
I am porting the following code from pandas to cudf:

_Pandas:_

file_open = file_df[file_df['activity'] == 914788726] # 'File Open'
file_copy = file_df[file_df['activity'] == 779177968] # 'File Copy'
file_write = file_df[file_df['activity'] == -1294924958] # 'File Write'
file_delete = file_df[file_df['activity'] == 1520560660] # 'File Delete'

_cuDF:_

file_open = file_df.query("activity == 914788725") # 'File Open'
file_copy = file_df.query("activity == 779177968") # 'File Copy'
file_write = file_df.query("activity == -1294924958" # 'File Write'
file_delete = file_df.query("activity == 1520560660") # 'File Delete'

(Not sure if there is a better alternative to port that code from Pandas to cuDF)

The column named _activity_ contains some null values, and the following error is displayed:
AssertionError Traceback (most recent call last)
in
----> 1 file_open = file_df.query("activity == 914788725") # 'File Open'

/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/dataframe.py in query(self, expr)
1644 newdf = DataFrame()
1645 for col in self.columns:
-> 1646 newseries = self[col][selected]
1647 newdf[col] = newseries
1648 result = newdf

/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/series.py in __getitem__(self, arg)
217 elif arg.dtype in [np.bool, np.bool_]:
218 selvals, selinds = columnops.column_select_by_boolmask(
--> 219 self._column, arg)
220 index = self.index.take(selinds.to_gpu_array())
221 else:

/conda/envs/rapids/lib/python3.6/site-packages/cudf-0.6.0.dev0+407.g4584c136.dirty-py3.6-linux-x86_64.egg/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
107 """
108 from .numerical import NumericalColumn
--> 109 assert column.null_count == 0 # We don't properly handle the boolmask yet
110 boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
111 indices = cudautils.arange(len(boolmask))

AssertionError:

Steps/Code to reproduce bug
Run this code:

test_df = cudf.DataFrame({'activity': [914788725, 779177968, -1294924958, None, 1520560660]})

file_open = test_df.query("activity == 914788725").shape[0]

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information):
DGX-1, 0.6 branch

cuDF (Python) feature request

Most helpful comment

Hi @kkraus14,

I think this issue should be labeled as ’bug’ instead of ’feature request’.

It is also failing with columns without nulls.

Thanks!
Miguel

All 6 comments

We see the same error if there are no nulls in the list.

Hi @thomcom ,

You are right. I have replaced the nulls with the sentinel value 1, and the error persists.

file_df["activity"] = file_df["activity"].fillna(1)

Regards,
Miguel

Hi @kkraus14,

I think this issue should be labeled as ’bug’ instead of ’feature request’.

It is also failing with columns without nulls.

Thanks!
Miguel

@thomcom @miguelangel The problem exists when any column in the dataframe has nulls. The workaround of fillna on all columns before query works for me.

@shwina this should be dependent on the apply_boolean_mask supporting nulls which I believe you're already tackling.

Fixed by #1956

Was this page helpful?
0 / 5 - 0 ratings