Cudf: [BUG] pyarrow.lib.ArrowInvalid: Offset invariant failure — in GPU-acc parquet writer

Created on 18 Mar 2020  Â·  6Comments  Â·  Source: rapidsai/cudf

I’m trying to use the partition_cols that’s been merged recently in the GPU-acc parquet writer.

I get an error like this one sometimes:

File "<ipython-input-5-1fe22af53ddf>", line 11, in gfn_telemetry_passthrough
    df.to_parquet(filename, partition_cols=["HTTP_USER_AGENT"])
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
    pq.to_parquet(self, path, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 185, in to_parquet
    **kwargs,
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 103, in write_to_dataset
    write_df.to_parquet(full_path, index=preserve_index, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
    pq.to_parquet(self, path, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 198, in to_parquet
    df, path, index, compression=compression, statistics=statistics
  File "cudf/_libxx/parquet.pyx", line 156, in cudf._libxx.parquet.write_parquet
  File "cudf/_libxx/parquet.pyx", line 195, in cudf._libxx.parquet.write_parquet
  File "cudf/_libxx/parquet.pyx", line 46, in cudf._libxx.parquet.generate_pandas_metadata
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/column/string.py", line 2143, in to_arrow
    len(self), obuf, sbuf, nbuf, self.null_count
  File "pyarrow/array.pxi", line 1363, in pyarrow.lib.StringArray.from_buffers
    return Array.from_buffers(utf8(), length,
  File "pyarrow/array.pxi", line 724, in pyarrow.lib.Array.from_buffers
    result.validate()
  File "pyarrow/array.pxi", line 976, in pyarrow.lib.Array.validate
    check_status(self.ap.Validate())
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Offset invariant failure at: 32 inconsistent value_offsets for null slot662!=626
distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client

Not sure what's the cause for this. Can someone please look into this?

bug cuIO libcudf

All 6 comments

cc @davidwendt this is similar to other issues we've seen where nulls elements are being marked with a non-zero number of bytes according to the offsets column

pinged @chinmaychandak offline for a reproducer

Just copy-pasting this out here, I'm doing:

df = cudf.read_json(...)
df.to_parquet(partition_cols=[...])

Note: it's erroring before it actually hits the parquet writer in trying to convert the cuDF DataFrame to a pyarrow.Table and specifically in handling string columns with nulls where in cuDF it looks like we're ending up with nonzero byte nulls somehow. cc @OlivierNV as well for the read_json path.

What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)

What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)

This only involves Python and basically we just copy the existing Buffers from the StringColumn to the host in a way that PyArrow can understand them, then ask Arrow to create an Array from those host buffers.

Not seeing this anymore! Something changed in cudf for the better since yesterday :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

c-jamie picture c-jamie  Â·  3Comments

saifrahmed picture saifrahmed  Â·  3Comments

shwina picture shwina  Â·  3Comments

galipremsagar picture galipremsagar  Â·  3Comments

yasmina-altair picture yasmina-altair  Â·  3Comments