Description:
The percentile_weightnening kernel in cupy.percentile will read one element past the end of the input array if it is asked to calculate the 100th quantile. If a NaN happens to be stored at that device memory location, the reported 100th quantile value will be NaN rather than the largest value in the array. Because that element is multiplied by zero (weight_above), this issue _only_ manifests when that value past the end of the array is a NaN.
Conditions
Code to reproduce (SEE COMMENT BELOW FOR BETTER REPRODUCER)
This is somewhat difficult to consistently reproduce without modifying cuPy code. One has to simply be unlucky about the value that happens to be at the memory address just past the end of the input array. If we add the following lines to order.py at line 241 just before the invocation of percentile_weightnening, however, we can force the issue:
# Allocate new array with spot for NaN at the end
ap_new = cupy.zeros((ap.shape[0] + 1,), dtype=ap.dtype)
# Put NaN at end of array
ap_new[-1] = cupy.nan
# Set rest of array to the original values
ap_new[:-1] = ap
# Set the input array to percentile_weightnening to the original values (without NaN)
ap = ap_new[:-1]
Note that we are still passing the same values into the kernel; we have merely ensured that the memory just past the end of the input array contains a NaN.
With this modification in place, the following will consistently reproduce the issue:
import cupy
quantiles = cupy.array([0, 50, 100], dtype='float32')
arr = cupy.array([0, 1, 2, 3, 4], dtype='float32')
print(cupy.percentile(arr, quantiles))
# Expected output: [0, 2, 4]
# Actual output: [0, 2, nan]
Without modifying the cupy code, we have to be unlucky to encounter this bug. I first discovered this issue due to an intermittent failure in a cuML unit test as described here.
percentile_weightnening.========= Invalid __global__ read of size 4
========= at 0x00000190 in percentile_weightnening
========= by thread (5,0,0) in block (0,0,0)
========= Address 0x7faa0480abd0 is out of bounds
========= Device Frame:percentile_weightnening (percentile_weightnening : 0x190)
We can also confirm that the array index in this line can reach a value equal to the size of the array by adding a printf inside the kernel.
I've just managed to find a consistent reproducer that does not require any modification of cupy code. By flooding the available GPU memory with NaNs first, we can ensure that a NaN will be found immediately after the input array. I don't know if this will reproduce the issue 100% of the time, and you may need to adjust the size of the NaN array based on your available GPU memory, but the following has reproduced the issue every time I've run it on a Quadro RTX 8000:
import cupy
import gc
nans = cupy.full((int(1e10),), cupy.nan, dtype='float32')
del nans
gc.collect()
quantiles = cupy.array([0, 50, 100], dtype='float32')
arr = cupy.array([0, 1, 2, 3, 4], dtype='float32')
print(cupy.percentile(arr, quantiles))
# Expected output: [0, 2, 4]
# Actual output: [0, 2, nan]
I'm working on a fix for this and will post a PR tomorrow.
PR #4453 has the fix. I've marked it as WIP pending feedback on the associated unit test.
Most helpful comment
I'm working on a fix for this and will post a PR tomorrow.