Describe the bug
When using apply_rows, if any value in the row is None, even if it is not used in the calculation, it will return null.
Steps/Code to reproduce bug
Here's a minimal example:
import cudf
import numpy as np
import pandas as pd
gpu_df = cudf.from_pandas(
pd.DataFrame({
'id': [0,1,2,3,4,5,6,7]
, 'non_missing': [0,9,3,1,6,4,1,3]
, 'missing': [None, 23,1,4,None,3,4,4]
})
)
def multiplyNonMissing(non_missing, multiplied):
for i, nm in enumerate(non_missing):
multiplied[i] = nm * 3
gpu_df.apply_rows(
multiplyNonMissing
, incols = ['non_missing']
, outcols = {'multiplied': np.int64}
, kwargs={}
)
The above code returns the following table:
id | non_missing | missing | multiplied
-- | -- | -- | --
0 | 0 | null | null
1 | 9 | 23.0 | 27
2 | 3 | 1.0 | 9
3 | 1 | 4.0 | 3
4 | 6 | null | null
5 | 4 | 3.0 | 12
6 | 1 | 4.0 | 3
7 | 3 | 4.0 | 9
Expected behavior
The above code should return the following table:
id | non_missing | missing | multiplied
-- | -- | -- | --
0 | 0 | null | 0
1 | 9 | 23.0 | 27
2 | 3 | 1.0 | 9
3 | 1 | 4.0 | 3
4 | 6 | null | 18
5 | 4 | 3.0 | 12
6 | 1 | 4.0 | 3
7 | 3 | 4.0 | 9
Environment overview (please complete the following information)
docker pull rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04docker run -d --runtime=nvidia -it -v /home/tom/Documents/programming/:/rapids/notebooks/host_folder -p 8888:8888 -p 8787:8787 -p 8786:8786 25ad75cc @cwharris Are we checking all of the columns instead of just the columns passed to the kernel erroneously?
I can reproduce this. I'll set up a unit test and see what's going on. Here's some relevant code. I'm not sure what's going on, but I'll find out.
def make_aggregate_nullmask(df, columns=None, op="and"):
out_mask = None
for k in columns or df.columns:
if not df[k].has_null_mask:
continue
...
Have been looking into some unrelated aspects of the apply_rows code and saw this @cwharris .
I believe the loop through all columns by default is the cause of this behavior.
This:
Combined with:
Results in the aggregate mask including all columns in the original dataframe if they have a nullmask. As a result, I suspect changing the function call to include columns=self.incols (or filtering the dataframe before passing it) may resolve this issue.
@beckernick yup! I wrote the code and forgot to use it. :sweat_smile:
Resolved by #2749 .
import cudf
import numpy as np
import pandas as pd
​
gpu_df = cudf.from_pandas(
pd.DataFrame({
'id': [0,1,2,3,4,5,6,7]
, 'non_missing': [0,9,3,1,6,4,1,3]
, 'missing': [None, 23,1,4,None,3,4,4]
})
)
​
def multiplyNonMissing(non_missing, multiplied):
for i, nm in enumerate(non_missing):
multiplied[i] = nm * 3
print(gpu_df.apply_rows(
multiplyNonMissing
, incols = ['non_missing']
, outcols = {'multiplied': np.int64}
, kwargs={}
))
id non_missing missing multiplied
0 0 0 null 0
1 1 9 23.0 27
2 2 3 1.0 9
3 3 1 4.0 3
4 4 6 null 18
5 5 4 3.0 12
6 6 1 4.0 3
7 7 3 4.0 9
Closing.
Most helpful comment
Have been looking into some unrelated aspects of the apply_rows code and saw this @cwharris .
I believe the loop through all columns by default is the cause of this behavior.
This:
https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L84-L88
Combined with:
https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L128-L131
Results in the aggregate mask including all columns in the original dataframe if they have a nullmask. As a result, I suspect changing the function call to include
columns=self.incols(or filtering the dataframe before passing it) may resolve this issue.