Cudf: [BUG] Boolean indexing on a series/dataframe replaces nulls with 0

Created on 16 Apr 2020 · 7Comments · Source: rapidsai/cudf

Describe the bug
A clear and concise description of what the bug is.

read_orc() converted some null data into 0 when reading from HDFS, but not all of null value in one column is changed. The data type contains int and 'float'.

Environment overview (please complete the following information)

Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
Method of cuDF install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used
- Environment location: Docker (containerized environment)
- Method of cuDF install: conda (rapids packages are installed using conda package manager)
- GPUs - Tesla V100 (32GB mem)

bug libcudf

Source

MikeChenfu

Most helpful comment

test.orc.zip
Here is the sample data.

MikeChenfu on 17 Apr 2020

🚀2

All 7 comments

@MikeChenfu is there any chance you could share a data sample either publicly or privately that reproduces this or build a reproducer? Otherwise we don't have too much to go on here.

kkraus14 on 16 Apr 2020

Is it reading through the Arrow::RandomAccessFile interface in the c++ api ?

OlivierNV on 17 Apr 2020

@kkraus14 I have created a sample data which can reproduce it. So far read_orcis good, but the null is changed when I do a filter.

>>>s = cudf.read_orc('test.orc',use_index=False)
>>>s[:5]

    col1    col2    col3    col4    col5
0   1   1614478 null    grandtruth  0
1   1   6738421 null    grandtruth  0
2   1   7232952 null    grandtruth  0
3   1   7972961 null    grandtruth  0
4   1   8443301 null    grandtruth  0

>>>s[(s.col1 == 1) & (s.col2 == 6738421)]

    col1    col2    col3    col4    col5
1   1   6738421 0.0 grandtruth 0

MikeChenfu on 17 Apr 2020

Is there any chance you can share the test.orc file here for us to reproduce?

kkraus14 on 17 Apr 2020

test.orc.zip
Here is the sample data.

MikeChenfu on 17 Apr 2020

🚀2

@rgsl888prabhu Could you take a look at this? Looks like a bug in boolean masking.

kkraus14 on 17 Apr 2020

👍1

Adding another reproducer here:

n_elem = 81_920 # Fails
a = cudf.Series(np.ones(n_elem))  
a[0] = None  
a[a.isna()]
0    0.0
dtype: float64

n_elem = 81_919 # Works
a = cudf.Series(np.ones(n_elem))  
a[0] = None  
a[a.isna()]
0    null
dtype: float64