Cudf: [BUG] to_orc fails if one of the columns is a string column

Created on 3 Oct 2019 · 3Comments · Source: rapidsai/cudf

Describe the bug
to_orc works if the df has only string columns. Experience a segfault if there are different dtypes in the df and one of them happens to be a string

Steps/Code to reproduce bug

import cudf
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
df.to_orc("file.orc") # Fails with a segfault

df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':str},seed = 1)
df.to_orc("file.orc") # works fine

df = cudf.datasets.randomdata(nrows=10, dtypes={'c':int,'a':float,'b':int},seed = 1)
df.to_orc("file.orc") # works fine

Expected behavior
No error

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Docker
- If method of install is [Docker], provide docker pull & docker run commands used
  
  docker dev nightlyat 6a7d06a50e1cef2d

bug cuIO

Source

ayushdg

Most helpful comment

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

>>> import cudf
>>> df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
>>> df.to_orc("/home/jieong/Downloads/file.orc")
python: ../src/io/orc/orc_writer_impl.cu:141: auto cudf::io::orc::orc_column::host_dict_chunk(size_t): Assertion `col->dtype == GDF_STRING || col->dtype == GDF_STRING_CATEGORY' failed.
Aborted (core dumped)

You would need to use debug libcudf to see the assert print :)

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

Found the problem in libcudf side. build_dictionaries() should be indexing using str_col_ids to access the string column.

j-ieong on 4 Oct 2019

👍2

All 3 comments

Looks like an assert is failing. It's expected that the column at this point should be a string.

j-ieong on 3 Oct 2019

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

ayushdg on 4 Oct 2019

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

>>> import cudf
>>> df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
>>> df.to_orc("/home/jieong/Downloads/file.orc")
python: ../src/io/orc/orc_writer_impl.cu:141: auto cudf::io::orc::orc_column::host_dict_chunk(size_t): Assertion `col->dtype == GDF_STRING || col->dtype == GDF_STRING_CATEGORY' failed.
Aborted (core dumped)

You would need to use debug libcudf to see the assert print :)

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

Found the problem in libcudf side. build_dictionaries() should be indexing using str_col_ids to access the string column.

j-ieong on 4 Oct 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings