Pandas: df.to_parquet crashes on pyarrow with timedelta64[ns]

Created on 12 Feb 2020 · 4Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

# any dataframe with a timedelta64[ns] type in it
df.to_parquet('/tmp/tmp.parquet.gzip', compression='gzip')

versions:

Python 3.7.6
pandas==0.25.3
pyarrow==0.15.1

Problem description

Traceback (most recent call last):
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet
    **kwargs
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet
    **kwargs
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/io/parquet.py", line 101, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 560, in dataframe_to_arrays
    convert_fields))
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/concurrent/futures/_base.py", line 598, in result_iterator
    yield fs.pop().result()
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/concurrent/futures/_base.py", line 428, in result
    return self.__get_result()
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 546, in convert_column
    raise e
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 540, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 207, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 74, in pyarrow.lib._ndarray_to_array
  File "pyarrow/array.pxi", line 62, in pyarrow.lib._ndarray_to_type
  File "pyarrow/error.pxi", line 86, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: ('Unsupported numpy type 22', 'Conversion failed for column elapsed_time with type timedelta64[ns]')

PS, Submitting an issue on pyarrow in JIRA is a complete 🤕 😞

Source

dazza-codes

All 4 comments

you should try with pyarrow 0.16

in any event this should be reported to the pyarrow tracker

jreback on 12 Feb 2020

pyarrow = 0.16.* does not fix this. Since this belongs at the intersection of pandas (numpy) and pyarrow, this issue could belong in either place and probably belongs in both places, since the responsibility for mapping serialization between them could belong in both places. Hence, I respectfully request clarification on where the responsibility lies for this serialization (it begins with pandas and the method belongs on pandas, so that's the natural place to put the issue since pandas unit tests are not catching the bug, it seems, and it's perfectly reasonable for pyarrrow to issue a NotImplemented exception, which implies that pandas is not using the pyarrow API correctly).

Traceback (most recent call last):

...

  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/core/frame.py", line 2237, in to_parquet
    **kwargs
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/io/parquet.py", line 254, in to_parquet
    **kwargs
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pandas/io/parquet.py", line 117, in write
    **kwargs
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pyarrow/parquet.py", line 1343, in write_table
    **kwargs) as writer:
  File "/opt/conda/envs/gis-dataprocessing/lib/python3.7/site-packages/pyarrow/parquet.py", line 448, in __init__
    **options)
  File "pyarrow/_parquet.pyx", line 1279, in pyarrow._parquet.ParquetWriter.__cinit__
  File "pyarrow/error.pxi", line 105, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: duration[ns]

dazza-codes on 12 Feb 2020

@dazza-codes sure pandas can add more testing
but this is a pyarrow traceback and that’s where it is handled

you should open an issue there

jreback on 12 Feb 2020

As you can see from the different error message with pyarrow 0.16, you can see that it now fails in writing to parquet, and no longer in converting pandas to arrow (which has been implemented in pyarrow 0.16).

An issue for timedelta (duration in pyarrow) support in parquet writing already exists: https://issues.apache.org/jira/browse/ARROW-6780