Cudf: [BUG] Boolean indexing on a series/dataframe replaces nulls with 0

Created on 16 Apr 2020  路  7Comments  路  Source: rapidsai/cudf

Describe the bug
A clear and concise description of what the bug is.

read_orc() converted some null data into 0 when reading from HDFS, but not all of null value in one column is changed. The data type contains int and 'float'.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]

    • If method of install is [Docker], provide docker pull & docker run commands used

    • Environment location: Docker (containerized environment)

    • Method of cuDF install: conda (rapids packages are installed using conda package manager)

    • GPUs - Tesla V100 (32GB mem)

bug libcudf

Most helpful comment

test.orc.zip
Here is the sample data.

All 7 comments

@MikeChenfu is there any chance you could share a data sample either publicly or privately that reproduces this or build a reproducer? Otherwise we don't have too much to go on here.

Is it reading through the Arrow::RandomAccessFile interface in the c++ api ?

@kkraus14 I have created a sample data which can reproduce it. So far read_orcis good, but the null is changed when I do a filter.

>>>s = cudf.read_orc('test.orc',use_index=False)
>>>s[:5]

    col1    col2    col3    col4    col5
0   1   1614478 null    grandtruth  0
1   1   6738421 null    grandtruth  0
2   1   7232952 null    grandtruth  0
3   1   7972961 null    grandtruth  0
4   1   8443301 null    grandtruth  0

>>>s[(s.col1 == 1) & (s.col2 == 6738421)]

    col1    col2    col3    col4    col5
1   1   6738421 0.0 grandtruth 0

Is there any chance you can share the test.orc file here for us to reproduce?

test.orc.zip
Here is the sample data.

@rgsl888prabhu Could you take a look at this? Looks like a bug in boolean masking.

Adding another reproducer here:

n_elem = 81_920 # Fails
a = cudf.Series(np.ones(n_elem))  
a[0] = None  
a[a.isna()]
0    0.0
dtype: float64

n_elem = 81_919 # Works
a = cudf.Series(np.ones(n_elem))  
a[0] = None  
a[a.isna()]
0    null
dtype: float64
Was this page helpful?
0 / 5 - 0 ratings