Pandas: Cannot specify pickle protocol used when writing HDF5.

Created on 28 Mar 2020  路  4Comments  路  Source: pandas-dev/pandas

Problem description

It appears currently impossible (unless I'm mistaken) to specify what pickle protocol will be used by DataFrame.to_hdf() (and thus by PyTables) if pickling is necessary. That makes it impossible to share HDF5 data written and read by a mix of clients running py37 and py38.

In Python 3.8, pickle protocol 5 was introduced (PEP-574). This prevents my team from supporting py38 in a system meant to support clients running a range of Python versions (from 3.6) and sharing a common distributed filesystem. We plan to eventually deprecate support for py36 and py37, but the problem is that there doesn't seem to be a way to manage the transition when such a new protocol is introduced.

xref: this StackOverflow question.

Example

(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan  8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
    return store.select(key, auto_close=auto_close, **kwargs)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
    return it.get_result()
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
    results = self.func(self.start, self.stop, where)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
    return s.read(start=_start, stop=_stop, where=_where, columns=columns)
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
    "block{idx}_values".format(idx=i), start=_start, stop=_stop
  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
    ret = node[0][start:stop]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
    return self.read(start, stop, step)[0]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
    outlistarr = [atom.fromarray(arr) for arr in listarr]
  File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
    return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>

Expected Behavior

df.to_hdf('foo', 'x', pickle_protocol=4)
IO HDF5

Most helpful comment

Oh, to be clear, I'm not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn't care less about what pickle protocol it uses, except that I want all my clients to be able to read it.

Alas, in those cases where DataFrame.to_hdf() resorts to some pickling, it is always with HIGHEST_PROTOCOL. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to let Pandas and PyTable know that, even if py38 is running, and in the case pickling is used somehow by to_hdf, then we wish it to use a specific protocol (e.g. 4).

All 4 comments

you are asking for forward compatibility which is not generally supported in pickle.
you can certainly have backward compatible pickles, eg you always write jn 4, which can be read by 5, but the reverse is not easy / possible.

i suppose you could add an option to support a protocol version.

I am not sure why you actually care about version 5; if you are writing pickled then is not very performant anyhow.

its just better is to not use pickle at all; either write format=table which don鈥檛 pickle or read/write parquet

Oh, to be clear, I'm not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn't care less about what pickle protocol it uses, except that I want all my clients to be able to read it.

Alas, in those cases where DataFrame.to_hdf() resorts to some pickling, it is always with HIGHEST_PROTOCOL. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to let Pandas and PyTable know that, even if py38 is running, and in the case pickling is used somehow by to_hdf, then we wish it to use a specific protocol (e.g. 4).

I agree that this would be a useful feature. Another use case is where I have some older data analysis code that relies on old modules that do not support python 3.8, but I would like to take advantage of new python features in simulation code that outputs a HDF5 to be read. While the workaround in the stackexchange post linked above works, it would be great to have this exposed in a more pythonic way.

What I don't get is why hdf io needs pickle-related stuff anyway? Isn't hdfs a language-independent data structure? Cause that's why I use it as a day-to-day data-storing/sharing format in the first place.

Any explanation is appreciated. 馃

Was this page helpful?
0 / 5 - 0 ratings

Related issues

marcelnem picture marcelnem  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

matthiasroder picture matthiasroder  路  3Comments

songololo picture songololo  路  3Comments

BDannowitz picture BDannowitz  路  3Comments