Cudf: [FEA] Match Pandas behaviour for boolean-int ops

Created on 3 Jul 2019 · 18Comments · Source: rapidsai/cudf

Results differs from Pandas for the rfloordiv and rmod operations between booleans and integers - can we match Pandas behaviour?

In [37]: a = cudf.Series([True, False])             

In [38]: b = cudf.Series([5, 5], dtype='int16')                                                                                                                                                                 

In [39]: print(a // b)                              
0    0                                                                                                  
1    0                            
dtype: int16                                                                                            

In [40]: print(b // a)                                                                                  
0     5                                 
1    -1                                                                                                 
dtype: int16                         

In [41]: a = pd.Series([True, False])               

In [42]: b = pd.Series([5, 5], dtype='int16')                                                           

In [43]: print(a // b)                              
0    0                                              
1    0    
dtype: int16                                        

In [44]: print(b // a)                              
0    5                                              
1    0                         
dtype: int16

In [47]: a = cudf.Series([True, False])             

In [48]: b = cudf.Series([5, 5], dtype='int32')     

In [49]: print(a // b)                              
0    0
1    0
dtype: int32

In [50]: print(b // a)                              
0             5
1    2147483647
dtype: int32

In [51]: a = pd.Series([True, False])               

In [52]: b = pd.Series([5, 5], dtype='int32')       

In [53]: print(a // b)                              
0    0
1    0
dtype: int32

In [55]: print(b // a)                              
0    5
1    0
dtype: int32

CC: @devavret @jrhemstad

cuDF (Python) feature request

Source

shwina

Most helpful comment

That will make us consistent with SQL at least.

For reference, Spark SQL performs a pre-pass on the divisor and replaces zero with null so the result of dividing by zero is null instead of NaN.

This just reinforces the idea that libcudf binops shouldn't do anything different than they are currently doing because high level consumers have different requirements and so we should just provide the necessary primitives to satisfy those requirements.

jrhemstad on 12 Feb 2020

👍3

All 18 comments

I think a C++ user would find division by zero resulting in zero to be surprising. Shouldn't the result be null / NaN? I notice that Python 3.6 throws an error if I try to divide by False:

>>> 5 // False
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero

harrism on 24 Sep 2019

@shwina @kkraus14 can you respond to my comment above?

We have to support multiple language front ends, and also stay true to C++. I think we need to gather expectations and choose an appropriate path forward.

harrism on 10 Feb 2020

It looks like Pandas has semi arbitrary behavior here depending on the data:

In [61]: pd.Series([5, 5], dtype='int32') // pd.Series([1, 0], dtype='int8')
Out[61]:
0    5.0
1    inf
dtype: float64

In [62]: pd.Series([5, 5], dtype='int32') // pd.Series([1, 1], dtype='int8')
Out[62]:
0    5
1    5
dtype: int32

If the right hand side is bool then it maintains the left hand side dtype and returns 0 instead of inf / nan. This feels like incorrect behavior in general here and I think we should just throw on divide by zero for now.

kkraus14 on 11 Feb 2020

We can't throw on divide by zero. The division happens in GPU device code which can't throw. (GPUs don't have signaling NaNs / arithmetic exceptions because they would be very expensive / slow.)

harrism on 11 Feb 2020

We can't throw on divide by zero. The division happens in GPU device code which can't throw. (GPUs don't have signaling NaNs / arithmetic exceptions because they would be very expensive / slow.)

An ugly alternative is to first do something like an any_of over the divisor elements to check for 0 and throw if it returns true.

jrhemstad on 12 Feb 2020

Maybe that should be done in the bindings because clearly the required behavior for matching Pandas is different than other languages...

harrism on 12 Feb 2020

What is the current behavior if an input is 0 on the rhs of the function? We don't necessarily have to match Pandas behavior here, especially if it doesn't make sense.

kkraus14 on 12 Feb 2020

I believe you would get NaN.

harrism on 12 Feb 2020

How about setting the output value to null. Since there’s not really a c++ parallel here, it’s not technically inconsistent.

devavret on 12 Feb 2020

How about setting the output value to null. Since there’s not really a c++ parallel here, it’s not technically inconsistent.

The problem with this is then we can't differentiate between when we had input nulls that should propagate through the binaryop.

kkraus14 on 12 Feb 2020

have we checked the behavior in pandas 1.0?

datametrician on 12 Feb 2020

have we checked the behavior in pandas 1.0?

I'm 95% sure Pandas just defers down to numpy here so I don't think the behavior will have changed. Especially because there's no new nullable Float types yet.

kkraus14 on 12 Feb 2020

👍1

How about setting the output value to null. Since there’s not really a c++ parallel here, it’s not technically inconsistent.

The problem with this is then we can't differentiate between when we had input nulls that should propagate through the binaryop.

Understood. Then what do you want to treat the result of x/0? I think SQL throws an exception. If an exception is ok, then we might want to do what @jrhemstad said and use any_of. That will make us consistent with SQL at least.

devavret on 12 Feb 2020

Binary ops should not be modified to do this. Detecting division by zero in every element could be a large overhead in cases where you know there are no zeros. Instead, I think you should optionally check for NaNs after or for zeros before.

harrism on 12 Feb 2020

👍2

Binary ops should not be modified to do this. Detecting division by zero in every element could be a large overhead in cases where you know there are no zeros. Instead, I think you should _optionally_ check for NaNs after or for zeros before.

Agreed. The primitives are in place (or can be made) such that the higher level library can implement whatever safety checking they desire.

jrhemstad on 12 Feb 2020

That will make us consistent with SQL at least.

For reference, Spark SQL performs a pre-pass on the divisor and replaces zero with null so the result of dividing by zero is null instead of NaN.

jlowe on 12 Feb 2020

That will make us consistent with SQL at least.

For reference, Spark SQL performs a pre-pass on the divisor and replaces zero with null so the result of dividing by zero is null instead of NaN.

jrhemstad on 12 Feb 2020

👍3

And "False / 0 == 0" is just trouble. Removing libcudf label, adding cuDF Python label.

harrism on 13 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

[BUG] DataFrame.empty doesn't perform as advertised in the documentation

yasmina-altair · 3Comments

[FEA] Introduce `DecimalDtype` to cuDF

shwina · 3Comments

automatically infer dtypes when using read_csv

ericmjl · 3Comments

[BUG] datetime dtype not being inferred by cudf avro reader

galipremsagar · 3Comments

[FEA] Remove `experimental` namespace from libcudf

jrhemstad · 3Comments