I’m trying to use the partition_cols that’s been merged recently in the GPU-acc parquet writer.
I get an error like this one sometimes:
File "<ipython-input-5-1fe22af53ddf>", line 11, in gfn_telemetry_passthrough
df.to_parquet(filename, partition_cols=["HTTP_USER_AGENT"])
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
pq.to_parquet(self, path, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 185, in to_parquet
**kwargs,
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 103, in write_to_dataset
write_df.to_parquet(full_path, index=preserve_index, **kwargs)
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/dataframe.py", line 3851, in to_parquet
pq.to_parquet(self, path, *args, **kwargs)
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/io/parquet.py", line 198, in to_parquet
df, path, index, compression=compression, statistics=statistics
File "cudf/_libxx/parquet.pyx", line 156, in cudf._libxx.parquet.write_parquet
File "cudf/_libxx/parquet.pyx", line 195, in cudf._libxx.parquet.write_parquet
File "cudf/_libxx/parquet.pyx", line 46, in cudf._libxx.parquet.generate_pandas_metadata
File "/home/ubuntu/anaconda3/envs/kc2/lib/python3.7/site-packages/cudf/core/column/string.py", line 2143, in to_arrow
len(self), obuf, sbuf, nbuf, self.null_count
File "pyarrow/array.pxi", line 1363, in pyarrow.lib.StringArray.from_buffers
return Array.from_buffers(utf8(), length,
File "pyarrow/array.pxi", line 724, in pyarrow.lib.Array.from_buffers
result.validate()
File "pyarrow/array.pxi", line 976, in pyarrow.lib.Array.validate
check_status(self.ap.Validate())
File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Offset invariant failure at: 32 inconsistent value_offsets for null slot662!=626
distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
Not sure what's the cause for this. Can someone please look into this?
cc @davidwendt this is similar to other issues we've seen where nulls elements are being marked with a non-zero number of bytes according to the offsets column
pinged @chinmaychandak offline for a reproducer
Just copy-pasting this out here, I'm doing:
df = cudf.read_json(...)
df.to_parquet(partition_cols=[...])
Note: it's erroring before it actually hits the parquet writer in trying to convert the cuDF DataFrame to a pyarrow.Table and specifically in handling string columns with nulls where in cuDF it looks like we're ending up with nonzero byte nulls somehow. cc @OlivierNV as well for the read_json path.
What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)
What path does to_arrow() goes through on the libcudf cpp side ? (or does this only involve python)
This only involves Python and basically we just copy the existing Buffers from the StringColumn to the host in a way that PyArrow can understand them, then ask Arrow to create an Array from those host buffers.
Not seeing this anymore! Something changed in cudf for the better since yesterday :)