Results differs from Pandas for the rfloordiv and rmod operations between booleans and integers - can we match Pandas behaviour?
In [37]: a = cudf.Series([True, False])
In [38]: b = cudf.Series([5, 5], dtype='int16')
In [39]: print(a // b)
0 0
1 0
dtype: int16
In [40]: print(b // a)
0 5
1 -1
dtype: int16
In [41]: a = pd.Series([True, False])
In [42]: b = pd.Series([5, 5], dtype='int16')
In [43]: print(a // b)
0 0
1 0
dtype: int16
In [44]: print(b // a)
0 5
1 0
dtype: int16
In [47]: a = cudf.Series([True, False])
In [48]: b = cudf.Series([5, 5], dtype='int32')
In [49]: print(a // b)
0 0
1 0
dtype: int32
In [50]: print(b // a)
0 5
1 2147483647
dtype: int32
In [51]: a = pd.Series([True, False])
In [52]: b = pd.Series([5, 5], dtype='int32')
In [53]: print(a // b)
0 0
1 0
dtype: int32
In [55]: print(b // a)
0 5
1 0
dtype: int32
CC: @devavret @jrhemstad
I think a C++ user would find division by zero resulting in zero to be surprising. Shouldn't the result be null / NaN? I notice that Python 3.6 throws an error if I try to divide by False:
>>> 5 // False
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: integer division or modulo by zero
@shwina @kkraus14 can you respond to my comment above?
We have to support multiple language front ends, and also stay true to C++. I think we need to gather expectations and choose an appropriate path forward.
It looks like Pandas has semi arbitrary behavior here depending on the data:
In [61]: pd.Series([5, 5], dtype='int32') // pd.Series([1, 0], dtype='int8')
Out[61]:
0 5.0
1 inf
dtype: float64
In [62]: pd.Series([5, 5], dtype='int32') // pd.Series([1, 1], dtype='int8')
Out[62]:
0 5
1 5
dtype: int32
If the right hand side is bool then it maintains the left hand side dtype and returns 0 instead of inf / nan. This feels like incorrect behavior in general here and I think we should just throw on divide by zero for now.
We can't throw on divide by zero. The division happens in GPU device code which can't throw. (GPUs don't have signaling NaNs / arithmetic exceptions because they would be very expensive / slow.)
We can't throw on divide by zero. The division happens in GPU device code which can't throw. (GPUs don't have signaling NaNs / arithmetic exceptions because they would be very expensive / slow.)
An ugly alternative is to first do something like an any_of over the divisor elements to check for 0 and throw if it returns true.
Maybe that should be done in the bindings because clearly the required behavior for matching Pandas is different than other languages...
What is the current behavior if an input is 0 on the rhs of the function? We don't necessarily have to match Pandas behavior here, especially if it doesn't make sense.
I believe you would get NaN.
How about setting the output value to null. Since there鈥檚 not really a c++ parallel here, it鈥檚 not technically inconsistent.
How about setting the output value to null. Since there鈥檚 not really a c++ parallel here, it鈥檚 not technically inconsistent.
The problem with this is then we can't differentiate between when we had input nulls that should propagate through the binaryop.
have we checked the behavior in pandas 1.0?
have we checked the behavior in pandas 1.0?
I'm 95% sure Pandas just defers down to numpy here so I don't think the behavior will have changed. Especially because there's no new nullable Float types yet.
How about setting the output value to null. Since there鈥檚 not really a c++ parallel here, it鈥檚 not technically inconsistent.
The problem with this is then we can't differentiate between when we had input nulls that should propagate through the binaryop.
Understood. Then what do you want to treat the result of x/0? I think SQL throws an exception. If an exception is ok, then we might want to do what @jrhemstad said and use any_of. That will make us consistent with SQL at least.
Binary ops should not be modified to do this. Detecting division by zero in every element could be a large overhead in cases where you know there are no zeros. Instead, I think you should optionally check for NaNs after or for zeros before.
Binary ops should not be modified to do this. Detecting division by zero in every element could be a large overhead in cases where you know there are no zeros. Instead, I think you should _optionally_ check for NaNs after or for zeros before.
Agreed. The primitives are in place (or can be made) such that the higher level library can implement whatever safety checking they desire.
That will make us consistent with SQL at least.
For reference, Spark SQL performs a pre-pass on the divisor and replaces zero with null so the result of dividing by zero is null instead of NaN.
That will make us consistent with SQL at least.
For reference, Spark SQL performs a pre-pass on the divisor and replaces zero with null so the result of dividing by zero is null instead of NaN.
This just reinforces the idea that libcudf binops shouldn't do anything different than they are currently doing because high level consumers have different requirements and so we should just provide the necessary primitives to satisfy those requirements.
And "False / 0 == 0" is just trouble. Removing libcudf label, adding cuDF Python label.
Most helpful comment
This just reinforces the idea that libcudf binops shouldn't do anything different than they are currently doing because high level consumers have different requirements and so we should just provide the necessary primitives to satisfy those requirements.