When filtering a dataframe which contains NaN values cuDF fails
cdf[cdf.passenger_count > 2]
The above fails if cdf contains Null values
In [1]: import cudf
!head
In [2]: !cat head-nyctaxi.csv
VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
1,2017-01-09 11:13:28,2017-01-09 11:25:45,1,3.30,1,N,263,161,1,12.5,0,0.5,2,0,0.3,15.3
1,2017-01-09 11:32:27,2017-01-09 11:36:01,1,.90,1,N,186,234,1,5,0,0.5,1.45,0,0.3,7.25
1,2017-01-09 11:38:20,2017-01-09 11:42:05,1,1.10,1,N,164,161,1,5.5,0,0.5,1,0,0.3,7.3
1,2017-01-09 11:52:13,2017-01-09 11:57:36,1,1.10,1,N,236,75,1,6,0,0.5,1.7,0,0.3,8.5
2,2017-01-01 00:00:00,2017-01-01 00:00:00,1,.02,2,N,249,234,2,52,0,0.5,0,0,0.3,52.8
1,2017-01-01 00:00:02,2017-01-01 00:03:50,1,.50,1,N,48,48,2,4,0.5,0.5,0,0,0.3,5.3
2,2017-01-01 00:00:02,2017-01-01 00:39:22,4,7.75,1,N,186,36,1,22,0.5,0.5,4.66,0,0.3,27.96
1,2017-01-01 00:00:03,2017-01-01 00:06:58,1,.80,1,N,162,161,1,6,0.5,0.5,1.45,0,0.3,8.75
In [3]: cdf = cudf.read_csv('head-nyctaxi.csv')
In [4]: cdf.head().to_pandas()
Out[4]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 NaN NaT NaT -1 NaN -1 -1 -1 -1 -1 NaN NaN NaN NaN -1 NaN NaN
1 1.0 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.3 1 2313200 263 161 1 12.5 0.0 0.5 2.00 0 0.3 15.30
2 1.0 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.9 1 2313200 186 234 1 5.0 0.0 0.5 1.45 0 0.3 7.25
3 1.0 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.1 1 2313200 164 161 1 5.5 0.0 0.5 1.00 0 0.3 7.30
4 1.0 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.1 1 2313200 236 75 1 6.0 0.0 0.5 1.70 0 0.3 8.50
In [5]: cdf[cdf.passenger_count > 2]
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-5-cb090dcb433a> in <module>
----> 1 cdf[cdf.passenger_count > 2]
~/GitRepos/cudf/python/cudf/dataframe/dataframe.py in __getitem__(self, arg)
241 index = self.index.take(selinds.to_gpu_array())
242 for col in self._cols:
--> 243 df[col] = Series(self._cols[col][arg], index=index)
244 df.set_index(index)
245 else:
~/GitRepos/cudf/python/cudf/dataframe/series.py in __getitem__(self, arg)
217 elif arg.dtype in [np.bool, np.bool_]:
218 selvals, selinds = columnops.column_select_by_boolmask(
--> 219 self._column, arg)
220 index = self.index.take(selinds.to_gpu_array())
221 else:
~/GitRepos/cudf/python/cudf/dataframe/columnops.py in column_select_by_boolmask(column, boolmask)
107 """
108 from .numerical import NumericalColumn
--> 109 assert column.null_count == 0 # We don't properly handle the boolmask yet
110 boolbits = cudautils.compact_mask_bytes(boolmask.to_gpu_array())
111 indices = cudautils.arange(len(boolmask))
AssertionError:
I briefly chatted with @kkraus14 about this and Null/NA values are special cased. Note: Pandas does not have a concept of Null values an instead uses np.nan.
@kkraus14 also suggested I look at what DBs do when filtering Nulls below is an example.
TLDR when filtering a column which contains Nulls, the null values are not included in any binary operation:
postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
(2 rows)
conda install postgresql
initdb -D ~/tmp-postgres-data
postgres -D ~/tmp-postgres-data
createuser --superuser postgres
CREATE TABLE CUSTOMERS(
ID INT NOT NULL,
NAME VARCHAR (20) NOT NULL,
AGE INT NOT NULL,
ADDRESS CHAR (25) ,
SALARY DECIMAL (18, 2),
PRIMARY KEY (ID)
);
INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
VALUES
(1, 'James Dean', 32, 'Detroit', 200.0),
(2, 'James McGhan', 33, 'Los Angeles', 300.0),
(3, 'Empty Salary', 33, 'New York', NULL);
postgres=# select * from CUSTOMERS;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
3 | Empty Salary | 33 | New York |
(3 rows)
postgres=# select * from CUSTOMERS where salary > 1.0
postgres-# ;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
(2 rows)
@quasiben I'm also curious what the behavior would be on a float column in the database with both a NaN and a NULL with the query as something like:
select (salary > 1.0) as test_col from CUSTOMERS
cc @jrhemstad for libcudf visibility
@kkraus14 does this relate to the bitmask functions you want libcudf to implement?
@jrhemstad this would be specifically related to binary / unary ops with regards to the bitmask.
postgres=# INSERT INTO CUSTOMERS (ID, NAME, AGE, ADDRESS, SALARY)
postgres-# VALUES
postgres-# (1, 'James Dean', 32, 'Detroit', 200.0),
postgres-# (2, 'James McGhan', 33, 'Los Angeles', 300.0),
postgres-# (3, 'Empty Salary', 33, 'New York', NULL),
postgres-# (4, 'Nan Salary', 33, 'Santa Clara', 0.0);
INSERT 0 4
postgres=# update customers set salary = 'NaN' where ID=4;
UPDATE 1
postgres=# select * from customers
postgres-# ;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
3 | Empty Salary | 33 | New York |
4 | Nan Salary | 33 | Santa Clara | NaN
(4 rows)
postgres=# select * from CUSTOMERS where salary > 1.0;
id | name | age | address | salary
----+--------------+-----+---------------------------+--------
1 | James Dean | 32 | Detroit | 200.00
2 | James McGhan | 33 | Los Angeles | 300.00
4 | Nan Salary | 33 | Santa Clara | NaN
(3 rows)
Note: In most implementations of the "not-a-number" concept, NaN is not considered equal to any other numeric value (including NaN). In order to allow numeric values to be sorted and used in tree-based indexes, PostgreSQL treats NaN values as equal, and greater than all non-NaN values.
I also ran into this and came up with this reproducer before I found that the issue had already been raised. Pasting it here anyway in case it's useful for reproduction or a test.
import cudf
df = cudf.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]})
df[df.x > 2]
I also ran into this and came up with this reproducer before I found that the issue had already been raised. Pasting it here anyway in case it's useful for reproduction or a test.
import cudf df = cudf.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]}) df[df.x > 2]
@mrocklin what would your expected output here be? Would null > 2 return false? Would null < 2 return false? Would null > 2 return null? If selecting by boolean mask I assume treat null as False?
I don't know. I would defer to whatever databases do. It looks like @quasiben may have some ideas above.
This is related to #902 which will allow us to move the boolean masking functionality to the libcudf side.
It's starting to look like "better null support" might be a theme for 0.7
On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote:
This is related to #902 https://github.com/rapidsai/cudf/pull/902 which
will allow us to move the boolean masking functionality to the libcudf side.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rapidsai/cudf/issues/991#issuecomment-468840128, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AASszD9eLgAJxqMo9q6VBY2GTzpav54Vks5vSbCEgaJpZM4bDoXG
.
if cudf-py is trying to be a drop-in replacement for pandas then I would say it should do the pandas things
In [4]: import pandas as pd
...: df = pd.DataFrame({'x': [1, 2, None, 4], 'y': [1, 2, 3, 4]})
...: df[df.x > 2]
...:
Out[4]:
x y
3 4.0 4
In [5]: df.x.sum()
Out[5]: 7.0
I think that we can deviate from Pandas from time to time. Null handling is something that Pandas perhaps sees as a wart and they seem to be moving towards nullable columns in 0.24. I suggest that we defer to SQL semantics here if possible.
This is related to #902 which will allow us to move the boolean masking functionality to the libcudf side.
Note that you can _already_ do this, #902 simply is an optimization (though likely will rename the API...)
Apply_boolean_mask will not make it into 0.7 so moving this to 0.8.
Apply_boolean_mask was merged into 0.8, reassigning to @shwina since he's been poking at the relevant Python / Cython bits.
Most helpful comment
It's starting to look like "better null support" might be a theme for 0.7
On Fri, Mar 1, 2019 at 2:58 PM Keith Kraus notifications@github.com wrote: