Cudf: [BUG] apply_rows returns nulls when any column contains null

Created on 4 Sep 2019  Â·  5Comments  Â·  Source: rapidsai/cudf

Describe the bug
When using apply_rows, if any value in the row is None, even if it is not used in the calculation, it will return null.

Steps/Code to reproduce bug
Here's a minimal example:

import cudf
import numpy as np
import pandas as pd

gpu_df = cudf.from_pandas(
    pd.DataFrame({
          'id': [0,1,2,3,4,5,6,7]
        , 'non_missing': [0,9,3,1,6,4,1,3]
        , 'missing': [None, 23,1,4,None,3,4,4]
    })
)

def multiplyNonMissing(non_missing, multiplied):
    for i, nm in enumerate(non_missing):
        multiplied[i] = nm * 3

gpu_df.apply_rows(
    multiplyNonMissing
    , incols = ['non_missing']
    , outcols = {'multiplied': np.int64}
    , kwargs={}
)

The above code returns the following table:

id | non_missing | missing | multiplied
-- | -- | -- | --
0 | 0 | null | null
1 | 9 | 23.0 | 27
2 | 3 | 1.0 | 9
3 | 1 | 4.0 | 3
4 | 6 | null | null
5 | 4 | 3.0 | 12
6 | 1 | 4.0 | 3
7 | 3 | 4.0 | 9

Expected behavior
The above code should return the following table:

id | non_missing | missing | multiplied
-- | -- | -- | --
0 | 0 | null | 0
1 | 9 | 23.0 | 27
2 | 3 | 1.0 | 9
3 | 1 | 4.0 | 3
4 | 6 | null | 18
5 | 4 | 3.0 | 12
6 | 1 | 4.0 | 3
7 | 3 | 4.0 | 9

Environment overview (please complete the following information)

  • Environment location: Docker image rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04, hash 25ad7525b9c1
  • Method of cuDF install: Docker
    docker pull rapidsai/rapidsai:cuda10.0-runtime-ubuntu18.04
    docker run -d --runtime=nvidia -it -v /home/tom/Documents/programming/:/rapids/notebooks/host_folder -p 8888:8888 -p 8787:8787 -p 8786:8786 25ad75
bug cuDF (Python)

Most helpful comment

Have been looking into some unrelated aspects of the apply_rows code and saw this @cwharris .

I believe the loop through all columns by default is the cause of this behavior.

This:

https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L84-L88

Combined with:

https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L128-L131

Results in the aggregate mask including all columns in the original dataframe if they have a nullmask. As a result, I suspect changing the function call to include columns=self.incols (or filtering the dataframe before passing it) may resolve this issue.

All 5 comments

cc @cwharris Are we checking all of the columns instead of just the columns passed to the kernel erroneously?

I can reproduce this. I'll set up a unit test and see what's going on. Here's some relevant code. I'm not sure what's going on, but I'll find out.

def make_aggregate_nullmask(df, columns=None, op="and"):
    out_mask = None
    for k in columns or df.columns:
        if not df[k].has_null_mask:
            continue
    ...

Have been looking into some unrelated aspects of the apply_rows code and saw this @cwharris .

I believe the loop through all columns by default is the cause of this behavior.

This:

https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L84-L88

Combined with:

https://github.com/rapidsai/cudf/blob/8e9789586f4cd9a201a24efce103ad178b8d268a/python/cudf/cudf/utils/applyutils.py#L128-L131

Results in the aggregate mask including all columns in the original dataframe if they have a nullmask. As a result, I suspect changing the function call to include columns=self.incols (or filtering the dataframe before passing it) may resolve this issue.

@beckernick yup! I wrote the code and forgot to use it. :sweat_smile:

Resolved by #2749 .

import cudf
import numpy as np
import pandas as pd
​
gpu_df = cudf.from_pandas(
    pd.DataFrame({
          'id': [0,1,2,3,4,5,6,7]
        , 'non_missing': [0,9,3,1,6,4,1,3]
        , 'missing': [None, 23,1,4,None,3,4,4]
    })
)
​
def multiplyNonMissing(non_missing, multiplied):
    for i, nm in enumerate(non_missing):
        multiplied[i] = nm * 3

print(gpu_df.apply_rows(
    multiplyNonMissing
    , incols = ['non_missing']
    , outcols = {'multiplied': np.int64}
    , kwargs={}
))
   id  non_missing missing  multiplied
0   0            0    null           0
1   1            9    23.0          27
2   2            3     1.0           9
3   3            1     4.0           3
4   4            6    null          18
5   5            4     3.0          12
6   6            1     4.0           3
7   7            3     4.0           9

Closing.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shwina picture shwina  Â·  3Comments

yasmina-altair picture yasmina-altair  Â·  3Comments

Polarbeargo picture Polarbeargo  Â·  3Comments

jmkim picture jmkim  Â·  3Comments

saifrahmed picture saifrahmed  Â·  3Comments