Cudf: [BUG] Read ORC with cudf Fails

Created on 11 Jul 2019 · 4Comments · Source: rapidsai/cudf

Read ORC fails with CUDF but successful with pyarrow

>>> df = cudf.read_orc('data/000000_0')

**Traceback (most recent call last):**                                            
  File "<stdin>", line 1, in <module>                                         
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/io/orc.py", line 45, in read_orc                          
    filepath_or_buffer, columns, stripe, skip_rows, num_rows, use_index       
  File "cudf/bindings/orc.pyx", line 25, in cudf.bindings.orc.cpp_read_orc    
  File "cudf/bindings/orc.pyx", line 86, in cudf.bindings.orc.cpp_read_orc    
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/column.py", line 130, in from_mem_views         
    return columnops.build_column(data_buf, data_mem.dtype, mask=mask)        
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/columnops.py", line 244, in build_column        
    data=buffer, dtype=dtype, mask=mask, name=name                            
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/numerical.py", line 42, in __init__             
    super(NumericalColumn, self).__init__(**kwargs)                           
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/columnops.py", line 36, in __init__             
    super(TypedColumnBase, self).__init__(**kwargs)                           
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/column.py", line 158, in __init__               
    self._update_null_count(null_count)                                       
  File "/conda/lib/python3.7/site-packages/cudf-0.9.0a0+1073.g73e0849-py3.7-li
nux-x86_64.egg/cudf/dataframe/column.py", line 177, in _update_null_count     
    nnz = count_nonzero_mask(self._mask.mem, size=len(self))                  
  File "cudf/bindings/cudf_cpp.pyx", line 489, in cudf.bindings.cudf_cpp.count
_nonzero_mask                                                                 
  File "cudf/bindings/cudf_cpp.pyx", line 499, in cudf.bindings.cudf_cpp.count
_nonzero_mask                                                                 
RuntimeError: CUDA error encountered at: /cudf/cpp/src/bitmask/legacy/bitmask_
ops.cu:147: 9 cudaErrorInvalidConfiguration invalid configuration argument

Read ORC with pyarrow

>>> import pyarrow.orc as orc                                                 
>>> with open('data/000000_0', 'rb') as file:                                 
...     data = orc.ORCFile(file)                                              
...     df = data.read().to_pandas()                                          
...                                                                           
>>> df                                                                        
      _col0  _col1                                                            
0         0      1                                                            
1  65975644     12                                                            
2   1251275     13                                                            
3  69856449     11

bug cuIO

Source

mlahir1

All 4 comments

@mlahir1 : can we use this file as part of automated regression testing ? It looks like it may be due to the file being written as snappy compression, but not containing any compressed data blocks (too small to gain from compression, so data blocks are actually uncompressed).

OlivierNV on 11 Jul 2019

This is a bit odd: it looks like the decoding went fine, but when we get to the point of initializing the column and counting the nonzero value count (which btw is entirely redundant since parquet/orc already initialize the null_count field), cudaGetLastError() returns 9. A dummy call to cudaGetLastError() clears the error state and the output is correct. Will dig a bit deeper, could be related to the __launch_bounds__ use in ORC.

OlivierNV on 11 Jul 2019

👍1

Ah, yup, that was indeed due to the number of compressed blocks being zero: we launch a kernel with a grid size of 0x0, which is harmless, but does result in cudaGetLastError() returning 9.

OlivierNV on 11 Jul 2019

👍1

@OlivierNV yes, you can use the file for your testing.

mlahir1 on 11 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings