Describe the bug
to_orc works if the df has only string columns. Experience a segfault if there are different dtypes in the df and one of them happens to be a string
Steps/Code to reproduce bug
import cudf
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
df.to_orc("file.orc") # Fails with a segfault
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':str},seed = 1)
df.to_orc("file.orc") # works fine
df = cudf.datasets.randomdata(nrows=10, dtypes={'c':int,'a':float,'b':int},seed = 1)
df.to_orc("file.orc") # works fine
Expected behavior
No error
Environment overview (please complete the following information)
docker pull & docker run commands useddocker dev nightlyat 6a7d06a50e1cef2dLooks like an assert is failing. It's expected that the column at this point should be a string.
Looks like an assert is failing. It's expected that the column at this point should be a string.
@j-ieong For my own curiosity can you point me to the failing assert?
I also tried by removing the branched logic for string columns and everything works as expected. I hadn't looked at the libcudf side of orc while reviewing and it seems that it handles GDF_STRING_CATEGORY as expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython for column_view_from_string_column
Looks like an assert is failing. It's expected that the column at this point should be a string.
@j-ieong For my own curiosity can you point me to the failing assert?
>>> import cudf
>>> df = cudf.datasets.randomdata(nrows=10, dtypes={'c':str,'a':int},seed = 1)
>>> df.to_orc("/home/jieong/Downloads/file.orc")
python: ../src/io/orc/orc_writer_impl.cu:141: auto cudf::io::orc::orc_column::host_dict_chunk(size_t): Assertion `col->dtype == GDF_STRING || col->dtype == GDF_STRING_CATEGORY' failed.
Aborted (core dumped)
You would need to use debug libcudf to see the assert print :)
I also tried by removing the branched logic for
stringcolumns and everything works as expected. I hadn't looked at the libcudf side oforcwhile reviewing and it seems that it handlesGDF_STRING_CATEGORYas expected and writes the appropriate strings. So I'm okay with removing the branched logic in Cython forcolumn_view_from_string_column
Found the problem in libcudf side. build_dictionaries() should be indexing using str_col_ids to access the string column.
Most helpful comment
You would need to use debug libcudf to see the assert print :)
Found the problem in libcudf side.
build_dictionaries()should be indexing usingstr_col_idsto access the string column.