Describe the bug
A clear and concise description of what the bug is.
read_orc() converted some null data into 0 when reading from HDFS, but not all of null value in one column is changed. The data type contains int and 'float'.
Environment overview (please complete the following information)
docker pull & docker run commands used@MikeChenfu is there any chance you could share a data sample either publicly or privately that reproduces this or build a reproducer? Otherwise we don't have too much to go on here.
Is it reading through the Arrow::RandomAccessFile interface in the c++ api ?
@kkraus14 I have created a sample data which can reproduce it. So far read_orcis good, but the null is changed when I do a filter.
>>>s = cudf.read_orc('test.orc',use_index=False)
>>>s[:5]
col1 col2 col3 col4 col5
0 1 1614478 null grandtruth 0
1 1 6738421 null grandtruth 0
2 1 7232952 null grandtruth 0
3 1 7972961 null grandtruth 0
4 1 8443301 null grandtruth 0
>>>s[(s.col1 == 1) & (s.col2 == 6738421)]
col1 col2 col3 col4 col5
1 1 6738421 0.0 grandtruth 0
Is there any chance you can share the test.orc file here for us to reproduce?
test.orc.zip
Here is the sample data.
@rgsl888prabhu Could you take a look at this? Looks like a bug in boolean masking.
Adding another reproducer here:
n_elem = 81_920 # Fails
a = cudf.Series(np.ones(n_elem))
a[0] = None
a[a.isna()]
0 0.0
dtype: float64
n_elem = 81_919 # Works
a = cudf.Series(np.ones(n_elem))
a[0] = None
a[a.isna()]
0 null
dtype: float64
Most helpful comment
test.orc.zip
Here is the sample data.