Cudf: [BUG] Selection with Null and NA data

Created on 19 Feb 2019 · 16Comments · Source: rapidsai/cudf

When filtering a dataframe which contains NaN values cuDF fails

cdf[cdf.passenger_count > 2]

The above fails if cdf contains Null values

In [1]: import cudf
!head
In [2]: !cat head-nyctaxi.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount

1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,1,N,263,161,1,12.5,0,0.5,2,0,0.3,15.3
1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,.90,1,N,186,234,1,5,0,0.5,1.45,0,0.3,7.25
1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,1,N,164,161,1,5.5,0,0.5,1,0,0.3,7.3
1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,1,N,236,75,1,6,0,0.5,1.7,0,0.3,8.5
2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,.02,2,N,249,234,2,52,0,0.5,0,0,0.3,52.8
1,2017-01-01 00:00:02,2017-01-01 00:03:50,1,.50,1,N,48,48,2,4,0.5,0.5,0,0,0.3,5.3
2,2017-01-01 00:00:02,2017-01-01 00:39:22,4,7.75,1,N,186,36,1,22,0.5,0.5,4.66,0,0.3,27.96
1,2017-01-01 00:00:03,2017-01-01 00:06:58,1,.80,1,N,162,161,1,6,0.5,0.5,1.45,0,0.3,8.75

In [3]: cdf = cudf.read_csv('head-nyctaxi.csv')

In [4]: cdf.head().to_pandas()
Out[4]:
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  trip_distance  RatecodeID  store_and_fwd_flag  PULocationID  DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount
0       NaN                  NaT                   NaT               -1            NaN          -1                  -1            -1            -1            -1          NaN    NaN      NaN         NaN            -1                    NaN           NaN
1       1.0  2017-01-09 11:13:28   2017-01-09 11:25:45                1            3.3           1             2313200           263           161             1         12.5    0.0      0.5        2.00             0                    0.3         15.30
2       1.0  2017-01-09 11:32:27   2017-01-09 11:36:01                1            0.9           1             2313200           186           234             1          5.0    0.0      0.5        1.45             0                    0.3          7.25
3       1.0  2017-01-09 11:38:20   2017-01-09 11:42:05                1            1.1           1             2313200           164           161             1          5.5    0.0      0.5        1.00             0                    0.3          7.30
4       1.0  2017-01-09 11:52:13   2017-01-09 11:57:36                1            1.1           1             2313200           236            75             1          6.0    0.0      0.5        1.70             0                    0.3          8.50

In [5]: cdf[cdf.passenger_count > 2]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-cb090dcb433a> in <module>
----> 1 cdf[cdf.passenger_count > 2]

~/GitRepos/cudf/python/cudf/dataframe/dataframe.py in __getitem__(self, arg)
    241                 index = self.index.take(selinds.to_gpu_array())
    242                 for col in self._cols:
--> 243                     df[col] = Series(self._cols[col][arg], index=index)
    244                 df.set_index(index)
    245             else:

~/GitRepos/cudf/python/cudf/dataframe/series.py in __getitem__(self, arg)
    217             elif arg.dtype in [np.bool, np.bool_]:
    218                 selvals, selinds = columnops.column_select_by_boolmask(
--> 219                     self._column, arg)
    220                 index = self.index.take(selinds.to_gpu_array())
    221             else:

~/GitRepos/cudf/python/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
    107     """
    108     from .numerical import NumericalColumn
--> 109     assert column.null_count == 0  # We don't properly handle the boolmask yet
    110     boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
    111     indices = cudautils.arange(len(boolmask))

AssertionError:

I briefly chatted with @kkraus14 about this and Null/NA values are special cased. Note: Pandas does not have a concept of Null values an instead uses np.nan.

@kkraus14 also suggested I look at what DBs do when filtering Nulls below is an example.
TLDR when filtering a column which contains Nulls, the null values are not included in any binary operation:

postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
(2 rows)

DB Setup

conda install postgresql

initdb -D ~/tmp-postgres-data
postgres -D ~/tmp-postgres-data
createuser --superuser postgres


CREATE TABLE CUSTOMERS(
   ID   INT              NOT NULL,
   NAME VARCHAR (20)     NOT NULL,
   AGE  INT              NOT NULL,
   ADDRESS  CHAR (25) ,
   SALARY   DECIMAL (18, 2),
   PRIMARY KEY (ID)
);


INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
VALUES
 (1, 'James Dean', 32, 'Detroit', 200.0),
 (2, 'James McGhan', 33, 'Los Angeles', 300.0),
 (3, 'Empty Salary', 33, 'New York', NULL);

postgres=# select * from CUSTOMERS;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
  3 | Empty Salary |  33 | New York                  |
(3 rows)

postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
(2 rows)

bug cuDF (Python) libcudf

Source

quasiben

Most helpful comment

It's starting to look like "better null support" might be a theme for 0.7

On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote:

This is related to #902 https://github.com/rapidsai/cudf/pull/902 which
will allow us to move the boolean masking functionality to the libcudf side.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/991#issuecomment-468840128, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszD9eLgAJxqMo9q6VBY2GTzpav54Vks5vSbCEgaJpZM4bDoXG
.

mrocklin on 2 Mar 2019

👍2

All 16 comments

@quasiben I'm also curious what the behavior would be on a float column in the database with both a NaN and a NULL with the query as something like:

select (salary > 1.0) as test_col from CUSTOMERS

kkraus14 on 19 Feb 2019

cc @jrhemstad for libcudf visibility

kkraus14 on 19 Feb 2019

@kkraus14 does this relate to the bitmask functions you want libcudf to implement?

jrhemstad on 19 Feb 2019

@jrhemstad this would be specifically related to binary / unary ops with regards to the bitmask.

kkraus14 on 19 Feb 2019

postgres=# INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
postgres-# VALUES
postgres-#  (1, 'James Dean', 32, 'Detroit', 200.0),
postgres-#  (2, 'James McGhan', 33, 'Los Angeles', 300.0),
postgres-#  (3, 'Empty Salary', 33, 'New York', NULL),
postgres-#  (4, 'Nan Salary', 33, 'Santa Clara', 0.0);
INSERT 0 4
postgres=# update customers set salary = 'NaN' where ID=4;
UPDATE 1
postgres=# select * from customers
postgres-# ;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
  3 | Empty Salary |  33 | New York                  |
  4 | Nan Salary   |  33 | Santa Clara               |    NaN
(4 rows)

postgres=# select * from CUSTOMERS where salary > 1.0;
 id |     name     | age |          address          | salary
----+--------------+-----+---------------------------+--------
  1 | James Dean   |  32 | Detroit                   | 200.00
  2 | James McGhan |  33 | Los Angeles               | 300.00
  4 | Nan Salary   |  33 | Santa Clara               |    NaN
(3 rows)

quasiben on 19 Feb 2019

Note: In most implementations of the "not-a-number" concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.

https://www.postgresql.org/docs/9.3/datatype-numeric.html

quasiben on 19 Feb 2019

I also ran into this and came up with this reproducer before I found that the issue had already been raised. Pasting it here anyway in case it's useful for reproduction or a test.

import cudf
df = cudf.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]})
df[df.x > 2]

mrocklin on 1 Mar 2019

I also ran into this and came up with this reproducer before I found that the issue had already been raised. Pasting it here anyway in case it's useful for reproduction or a test.
import cudf
df = cudf.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]})
df[df.x > 2]

@mrocklin what would your expected output here be? Would null > 2 return false? Would null < 2 return false? Would null > 2 return null? If selecting by boolean mask I assume treat null as False?

kkraus14 on 1 Mar 2019

I don't know. I would defer to whatever databases do. It looks like @quasiben may have some ideas above.

mrocklin on 1 Mar 2019

This is related to #902 which will allow us to move the boolean masking functionality to the libcudf side.

kkraus14 on 1 Mar 2019

It's starting to look like "better null support" might be a theme for 0.7

On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote:

This is related to #902 https://github.com/rapidsai/cudf/pull/902 which
will allow us to move the boolean masking functionality to the libcudf side.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/991#issuecomment-468840128, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszD9eLgAJxqMo9q6VBY2GTzpav54Vks5vSbCEgaJpZM4bDoXG
.

mrocklin on 2 Mar 2019

👍2

if cudf-py is trying to be a drop-in replacement for pandas then I would say it should do the pandas things

selections ignore None/Nulls
aggregations ignore None/ulls

In [4]: import pandas as pd 
   ...: df = pd.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]}) 
   ...: df[df.x > 2] 
   ...:                                                                                                                                                                                 
Out[4]: 
     x  y
3  4.0  4

In [5]: df.x.sum()                                                                                                                                                                      
Out[5]: 7.0

quasiben on 2 Mar 2019

👍1

I think that we can deviate from Pandas from time to time. Null handling is something that Pandas perhaps sees as a wart and they seem to be moving towards nullable columns in 0.24. I suggest that we defer to SQL semantics here if possible.

mrocklin on 2 Mar 2019

👍1

This is related to #902 which will allow us to move the boolean masking functionality to the libcudf side.

Note that you can _already_ do this, #902 simply is an optimization (though likely will rename the API...)

jrhemstad on 2 Mar 2019

Apply_boolean_mask will not make it into 0.7 so moving this to 0.8.

harrism on 3 May 2019

Apply_boolean_mask was merged into 0.8, reassigning to @shwina since he's been poking at the relevant Python / Cython bits.

kkraus14 on 16 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] Writing negative timestamps in ORC doesn't match after its read in

razajafri · 3Comments

automatically infer dtypes when using read_csv

ericmjl · 3Comments

[BUG] "NaT" string literal needs to be recognized as `null` in to_timestamps method

galipremsagar · 3Comments

[BUG] RunTimeError in `cudf::strings::starts_with`, `cudf::strings::ends_with` and `cudf::strings::find` when `target=''`

galipremsagar · 3Comments

[QST] What if I need Python 3 to integrate with PySyft

Polarbeargo · 3Comments