Cudf: [BUG] Exceptions loading Arrows with empty/na columns

Created on 10 Aug 2020 · 9Comments · Source: rapidsai/cudf

Describe the bug
When Arrow columns are null / all mixed nulls, cudf conversion fails. Failure modes seem different when we are reading a direct buffer vs from a file reader. This complicates scenarios when end-users provide files as we have little control in cleaning before ingest, only after.

Steps/Code to reproduce bug

Selectively uncomment bad1, bad2, bad3

import cudf, pandas as pd, pyarrow as pa

file_path = './arr0.arrow'
df = pd.DataFrame({
    'ok1': ['a', 'b', 'c'],
    'ok2': ['x', '', ''],
    'ok3': ['x', None, None],
    #'bad1': [None, None, None],
    #'bad2': ['', '', '']
})
#df['bad3'] = None

arr0 = pa.Table.from_pandas(df, preserve_index=False)

writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
writer.write(arr0)
writer.close()

# Test pa reads file without exn
reader = pa.ipc.open_file(file_path)
arr1 = reader.read_all()
arr1.to_pandas()

# Test cudf reads orig buffer without exn
gdf1 = cudf.DataFrame.from_arrow(arr0)

# Test cudf reads file without exn
gdf2 = cudf.DataFrame.from_arrow(arr1)

Expected behavior

No exceptions to be thrown

Environment overview (please complete the following information)

cudf 0.14.1 via conda in docker (graphistry/graphistry-forge-base:latest -> source activate rapids)

Environment details

multiple envs (ubuntu 18 / azure v100, ...)

Additional context

Not yet tested for 15/nightly

bug cuDF (Python)

Source

lmeyerov

All 9 comments

More fun: by indirecting through pandas ( cudf.DataFrame.from_pandas(arr0/1.to_pandas())), the Arrow buffers seems to load fine

lmeyerov on 10 Aug 2020

tested the reproducible code provided in : https://github.com/rapidsai/cudf/issues/5898#issue-675972750 with cudf 0.15 in cuda 11.0 and the code does not throw any exceptions.

In [2]: gdf2
Out[2]:
  ok1 ok2   ok3
0   a   x     x
1   b      <NA>
2   c      <NA>

In [3]: gdf1
Out[3]:
  ok1 ok2   ok3
0   a   x     x
1   b      <NA>
2   c      <NA>

Salonijain27 on 11 Aug 2020

👀1

@Salonijain27 looks like you didn't run the commented out part which is what causes the errors. Could you uncomment the bad1 / bad2 lines and report back here? cc @rgsl888prabhu as you've been poking around this area.

kkraus14 on 18 Aug 2020

@kkraus14 I tried with commented part and it works as expected and this is in 0.16.

>>> import cudf
>>> import cudf, pandas as pd, pyarrow as pa
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
...     'ok1': ['a', 'b', 'c'],
...     'ok2': ['x', '', ''],
...     'ok3': ['x', None, None],
...     'bad1': [None, None, None],
...     'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  None       None
1   b      None  None       None
2   c      None  None       None
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>
>>> gdf2
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>

rgsl888prabhu on 18 Aug 2020

👍1

@rgsl888prabhu does that include the recent arrow changes that you've been working on plumbing into Cython / Python or not? I'm wondering if this is fixed as of 0.15 or not.

kkraus14 on 18 Aug 2020

This is without my cython plumbing, mostly this might be fixed in 0.15.

rgsl888prabhu on 18 Aug 2020

👍1

@Salonijain27 any chance you could give this a shot with the latest 0.15 nightlies and report back?

kkraus14 on 18 Aug 2020

In 0.15 nightly,

>>> import cudf
>>> cudf.__version__
'0.15.0a+4742.gb639039fc'
>>> import cudf, pandas as pd, pyarrow as pa
>>> 
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
...     'ok1': ['a', 'b', 'c'],
...     'ok2': ['x', '', ''],
...     'ok3': ['x', None, None],
...     'bad1': [None, None, None],
...     'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>> 
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>> 
>>> # Test pa reads file without exn
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  None       None
1   b      None  None       None
2   c      None  None       None
>>> 
>>> # Test cudf reads orig buffer without exn
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)

>>> 
>>> # Test cudf reads file without exn
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>
>>> gdf2
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>

rgsl888prabhu on 18 Aug 2020

Great, looks like this is fixed as of latest 0.15, so closing.

kkraus14 on 18 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings