Cudf: [BUG] Exceptions loading Arrows with empty/na columns

Created on 10 Aug 2020  路  9Comments  路  Source: rapidsai/cudf

Describe the bug
When Arrow columns are null / all mixed nulls, cudf conversion fails. Failure modes seem different when we are reading a direct buffer vs from a file reader. This complicates scenarios when end-users provide files as we have little control in cleaning before ingest, only after.

Steps/Code to reproduce bug

Selectively uncomment bad1, bad2, bad3

import cudf, pandas as pd, pyarrow as pa

file_path = './arr0.arrow'
df = pd.DataFrame({
    'ok1': ['a', 'b', 'c'],
    'ok2': ['x', '', ''],
    'ok3': ['x', None, None],
    #'bad1': [None, None, None],
    #'bad2': ['', '', '']
})
#df['bad3'] = None

arr0 = pa.Table.from_pandas(df, preserve_index=False)

writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
writer.write(arr0)
writer.close()

# Test pa reads file without exn
reader = pa.ipc.open_file(file_path)
arr1 = reader.read_all()
arr1.to_pandas()

# Test cudf reads orig buffer without exn
gdf1 = cudf.DataFrame.from_arrow(arr0)

# Test cudf reads file without exn
gdf2 = cudf.DataFrame.from_arrow(arr1)

Expected behavior

No exceptions to be thrown

Environment overview (please complete the following information)

cudf 0.14.1 via conda in docker (graphistry/graphistry-forge-base:latest -> source activate rapids)

Environment details

multiple envs (ubuntu 18 / azure v100, ...)

Additional context

Not yet tested for 15/nightly

bug cuDF (Python)

All 9 comments

More fun: by indirecting through pandas ( cudf.DataFrame.from_pandas(arr0/1.to_pandas())), the Arrow buffers seems to load fine

tested the reproducible code provided in : https://github.com/rapidsai/cudf/issues/5898#issue-675972750 with cudf 0.15 in cuda 11.0 and the code does not throw any exceptions.

In [2]: gdf2
Out[2]:
  ok1 ok2   ok3
0   a   x     x
1   b      <NA>
2   c      <NA>

In [3]: gdf1
Out[3]:
  ok1 ok2   ok3
0   a   x     x
1   b      <NA>
2   c      <NA>

@Salonijain27 looks like you didn't run the commented out part which is what causes the errors. Could you uncomment the bad1 / bad2 lines and report back here? cc @rgsl888prabhu as you've been poking around this area.

@kkraus14 I tried with commented part and it works as expected and this is in 0.16.

>>> import cudf
>>> import cudf, pandas as pd, pyarrow as pa
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
...     'ok1': ['a', 'b', 'c'],
...     'ok2': ['x', '', ''],
...     'ok3': ['x', None, None],
...     'bad1': [None, None, None],
...     'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  None       None
1   b      None  None       None
2   c      None  None       None
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>
>>> gdf2
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>

@rgsl888prabhu does that include the recent arrow changes that you've been working on plumbing into Cython / Python or not? I'm wondering if this is fixed as of 0.15 or not.

This is without my cython plumbing, mostly this might be fixed in 0.15.

@Salonijain27 any chance you could give this a shot with the latest 0.15 nightlies and report back?

In 0.15 nightly,

>>> import cudf
>>> cudf.__version__
'0.15.0a+4742.gb639039fc'
>>> import cudf, pandas as pd, pyarrow as pa
>>> 
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
...     'ok1': ['a', 'b', 'c'],
...     'ok2': ['x', '', ''],
...     'ok3': ['x', None, None],
...     'bad1': [None, None, None],
...     'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>> 
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>> 
>>> # Test pa reads file without exn
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  None       None
1   b      None  None       None
2   c      None  None       None
>>> 
>>> # Test cudf reads orig buffer without exn
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)

>>> 
>>> # Test cudf reads file without exn
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>
>>> gdf2
  ok1 ok2   ok3  bad1 bad2  bad3
0   a   x     x  <NA>       <NA>
1   b      <NA>  <NA>       <NA>
2   c      <NA>  <NA>       <NA>

Great, looks like this is fixed as of latest 0.15, so closing.

Was this page helpful?
0 / 5 - 0 ratings