Cudf: [BUG] pyarrow.lib.ArrowInvalid: Offset invariant failure — in GPU-acc parquet writer

Created on 18 Mar 2020 · 6Comments · Source: rapidsai/cudf

I’m trying to use the partition_cols that’s been merged recently in the GPU-acc parquet writer.

I get an error like this one sometimes:

File "<ipython-input-5-1fe22af53ddf>", line 11, in gfn_telemetry_passthrough
    df.to_parquet(filename, partition_cols=["HTTP_USER_AGENT"])
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
    pq.to_parquet(self, path, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 185, in to_parquet
    **kwargs,
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 103, in write_to_dataset
    write_df.to_parquet(full_path, index=preserve_index, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
    pq.to_parquet(self, path, *args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 198, in to_parquet
    df, path, index, compression=compression, statistics=statistics
  File "cudf/_libxx/parquet.pyx", line 156, in cudf._libxx.parquet.write_parquet
  File "cudf/_libxx/parquet.pyx", line 195, in cudf._libxx.parquet.write_parquet
  File "cudf/_libxx/parquet.pyx", line 46, in cudf._libxx.parquet.generate_pandas_metadata
  File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/column/string.py", line 2143, in to_arrow
    len(self), obuf, sbuf, nbuf, self.null_count
  File "pyarrow/array.pxi", line 1363, in pyarrow.lib.StringArray.from_buffers
    return Array.from_buffers(utf8(), length,
  File "pyarrow/array.pxi", line 724, in pyarrow.lib.Array.from_buffers
    result.validate()
  File "pyarrow/array.pxi", line 976, in pyarrow.lib.Array.validate
    check_status(self.ap.Validate())
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Offset invariant failure at: 32 inconsistent value_offsets for null slot662!=626
distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client

Not sure what's the cause for this. Can someone please look into this?

bug cuIO libcudf

Source

chinmaychandak

All 6 comments

cc @davidwendt this is similar to other issues we've seen where nulls elements are being marked with a non-zero number of bytes according to the offsets column

pinged @chinmaychandak offline for a reproducer

kkraus14 on 18 Mar 2020

👍1

Just copy-pasting this out here, I'm doing:

df = cudf.read_json(...)
df.to_parquet(partition_cols=[...])

chinmaychandak on 18 Mar 2020

Note: it's erroring before it actually hits the parquet writer in trying to convert the cuDF DataFrame to a pyarrow.Table and specifically in handling string columns with nulls where in cuDF it looks like we're ending up with nonzero byte nulls somehow. cc @OlivierNV as well for the read_json path.

kkraus14 on 18 Mar 2020

What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)

OlivierNV on 18 Mar 2020

What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)

This only involves Python and basically we just copy the existing Buffers from the StringColumn to the host in a way that PyArrow can understand them, then ask Arrow to create an Array from those host buffers.

kkraus14 on 18 Mar 2020

Not seeing this anymore! Something changed in cudf for the better since yesterday :)

chinmaychandak on 19 Mar 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings