Describe the bug
When Arrow columns are null / all mixed nulls, cudf conversion fails. Failure modes seem different when we are reading a direct buffer vs from a file reader. This complicates scenarios when end-users provide files as we have little control in cleaning before ingest, only after.
Steps/Code to reproduce bug
Selectively uncomment bad1, bad2, bad3
import cudf, pandas as pd, pyarrow as pa
file_path = './arr0.arrow'
df = pd.DataFrame({
'ok1': ['a', 'b', 'c'],
'ok2': ['x', '', ''],
'ok3': ['x', None, None],
#'bad1': [None, None, None],
#'bad2': ['', '', '']
})
#df['bad3'] = None
arr0 = pa.Table.from_pandas(df, preserve_index=False)
writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
writer.write(arr0)
writer.close()
# Test pa reads file without exn
reader = pa.ipc.open_file(file_path)
arr1 = reader.read_all()
arr1.to_pandas()
# Test cudf reads orig buffer without exn
gdf1 = cudf.DataFrame.from_arrow(arr0)
# Test cudf reads file without exn
gdf2 = cudf.DataFrame.from_arrow(arr1)
Expected behavior
No exceptions to be thrown
Environment overview (please complete the following information)
cudf 0.14.1 via conda in docker (graphistry/graphistry-forge-base:latest -> source activate rapids)
Environment details
multiple envs (ubuntu 18 / azure v100, ...)
Additional context
Not yet tested for 15/nightly
More fun: by indirecting through pandas ( cudf.DataFrame.from_pandas(arr0/1.to_pandas())), the Arrow buffers seems to load fine
tested the reproducible code provided in : https://github.com/rapidsai/cudf/issues/5898#issue-675972750 with cudf 0.15 in cuda 11.0 and the code does not throw any exceptions.
In [2]: gdf2
Out[2]:
ok1 ok2 ok3
0 a x x
1 b <NA>
2 c <NA>
In [3]: gdf1
Out[3]:
ok1 ok2 ok3
0 a x x
1 b <NA>
2 c <NA>
@Salonijain27 looks like you didn't run the commented out part which is what causes the errors. Could you uncomment the bad1 / bad2 lines and report back here? cc @rgsl888prabhu as you've been poking around this area.
@kkraus14 I tried with commented part and it works as expected and this is in 0.16.
>>> import cudf
>>> import cudf, pandas as pd, pyarrow as pa
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
... 'ok1': ['a', 'b', 'c'],
... 'ok2': ['x', '', ''],
... 'ok3': ['x', None, None],
... 'bad1': [None, None, None],
... 'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x None None
1 b None None None
2 c None None None
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x <NA> <NA>
1 b <NA> <NA> <NA>
2 c <NA> <NA> <NA>
>>> gdf2
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x <NA> <NA>
1 b <NA> <NA> <NA>
2 c <NA> <NA> <NA>
@rgsl888prabhu does that include the recent arrow changes that you've been working on plumbing into Cython / Python or not? I'm wondering if this is fixed as of 0.15 or not.
This is without my cython plumbing, mostly this might be fixed in 0.15.
@Salonijain27 any chance you could give this a shot with the latest 0.15 nightlies and report back?
In 0.15 nightly,
>>> import cudf
>>> cudf.__version__
'0.15.0a+4742.gb639039fc'
>>> import cudf, pandas as pd, pyarrow as pa
>>>
>>> file_path = './arr0.arrow'
>>> df = pd.DataFrame({
... 'ok1': ['a', 'b', 'c'],
... 'ok2': ['x', '', ''],
... 'ok3': ['x', None, None],
... 'bad1': [None, None, None],
... 'bad2': ['', '', '']
... })
>>> df['bad3'] = None
>>> arr0 = pa.Table.from_pandas(df, preserve_index=False)
>>>
>>> writer = pa.RecordBatchFileWriter(file_path, arr0.schema)
>>> writer.write(arr0)
>>> writer.close()
>>>
>>> # Test pa reads file without exn
>>> reader = pa.ipc.open_file(file_path)
>>> arr1 = reader.read_all()
>>> arr1.to_pandas()
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x None None
1 b None None None
2 c None None None
>>>
>>> # Test cudf reads orig buffer without exn
>>> gdf1 = cudf.DataFrame.from_arrow(arr0)
>>>
>>> # Test cudf reads file without exn
>>> gdf2 = cudf.DataFrame.from_arrow(arr1)
>>> gdf1
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x <NA> <NA>
1 b <NA> <NA> <NA>
2 c <NA> <NA> <NA>
>>> gdf2
ok1 ok2 ok3 bad1 bad2 bad3
0 a x x <NA> <NA>
1 b <NA> <NA> <NA>
2 c <NA> <NA> <NA>
Great, looks like this is fixed as of latest 0.15, so closing.