Cudf: [BUG] to_orc fails if one of the columns is a string column

Created on 3 Oct 2019  路  3Comments  路  Source: rapidsai/cudf

Describe the bug
to_orc works if the df has only string columns. Experience a segfault if there are different dtypes in the df and one of them happens to be a string

Steps/Code to reproduce bug

import cudf
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
df.to_orc("file.orc") # Fails with a segfault
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':str},seed = 1)
df.to_orc("file.orc") # works fine

df = cudf.datasets.randomdata(nrows=10, dtypes={'c':int,'a':float,'b':int},seed = 1)
df.to_orc("file.orc") # works fine

Expected behavior
No error

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker

    • If method of install is [Docker], provide docker pull & docker run commands used

      docker dev nightlyat 6a7d06a50e1cef2d

bug cuIO

Most helpful comment

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

>>> import cudf
>>> df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
>>> df.to_orc("/home/jieong/Downloads/file.orc")
python: ../src/io/orc/orc_writer_impl.cu:141: auto cudf::io::orc::orc_column::host_dict_chunk(size_t): Assertion `col->dtype == GDF_STRING || col->dtype == GDF_STRING_CATEGORY' failed.
Aborted (core dumped)

You would need to use debug libcudf to see the assert print :)

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

Found the problem in libcudf side. build_dictionaries() should be indexing using str_col_ids to access the string column.

All 3 comments

Looks like an assert is failing. It's expected that the column at this point should be a string.

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

Looks like an assert is failing. It's expected that the column at this point should be a string.

@j-ieong For my own curiosity can you point me to the failing assert?

>>> import cudf
>>> df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
>>> df.to_orc("/home/jieong/Downloads/file.orc")
python: ../src/io/orc/orc_writer_impl.cu:141: auto cudf::io::orc::orc_column::host_dict_chunk(size_t): Assertion `col->dtype == GDF_STRING || col->dtype == GDF_STRING_CATEGORY' failed.
Aborted (core dumped)

You would need to use debug libcudf to see the assert print :)

I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column

Found the problem in libcudf side. build_dictionaries() should be indexing using str_col_ids to access the string column.

Was this page helpful?
0 / 5 - 0 ratings