It appears currently impossible (unless I'm mistaken) to specify what pickle protocol will be used by DataFrame.to_hdf() (and thus by PyTables) if pickling is necessary. That makes it impossible to share HDF5 data written and read by a mix of clients running py37 and py38.
In Python 3.8, pickle protocol 5 was introduced (PEP-574). This prevents my team from supporting py38 in a system meant to support clients running a range of Python versions (from 3.6) and sharing a common distributed filesystem. We plan to eventually deprecate support for py36 and py37, but the problem is that there doesn't seem to be a way to manage the transition when such a new protocol is introduced.
xref: this StackOverflow question.
(base) $ conda activate py38
(py38) $ python
Python 3.8.1 (default, Jan 8 2020, 22:29:32)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame(['hello', 'world']))
>>> df.to_hdf('foo', 'x')
>>> exit()
(py38) $ conda deactivate
(base) $ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_hdf('foo', 'x')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 407, in read_hdf
return store.select(key, auto_close=auto_close, **kwargs)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 782, in select
return it.get_result()
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 1639, in get_result
results = self.func(self.start, self.stop, where)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 766, in func
return s.read(start=_start, stop=_stop, where=_where, columns=columns)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 3206, in read
"block{idx}_values".format(idx=i), start=_start, stop=_stop
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/io/pytables.py", line 2737, in read_array
ret = node[0][start:stop]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 681, in __getitem__
return self.read(start, stop, step)[0]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in read
outlistarr = [atom.fromarray(arr) for arr in listarr]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/vlarray.py", line 825, in <listcomp>
outlistarr = [atom.fromarray(arr) for arr in listarr]
File "/opt/anaconda3/lib/python3.7/site-packages/tables/atom.py", line 1227, in fromarray
return six.moves.cPickle.loads(array.tostring())
ValueError: unsupported pickle protocol: 5
>>>
df.to_hdf('foo', 'x', pickle_protocol=4)
you are asking for forward compatibility which is not generally supported in pickle.
you can certainly have backward compatible pickles, eg you always write jn 4, which can be read by 5, but the reverse is not easy / possible.
i suppose you could add an option to support a protocol version.
I am not sure why you actually care about version 5; if you are writing pickled then is not very performant anyhow.
its just better is to not use pickle at all; either write format=table which don鈥檛 pickle or read/write parquet
Oh, to be clear, I'm not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn't care less about what pickle protocol it uses, except that I want all my clients to be able to read it.
Alas, in those cases where DataFrame.to_hdf() resorts to some pickling, it is always with HIGHEST_PROTOCOL. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to let Pandas and PyTable know that, even if py38 is running, and in the case pickling is used somehow by to_hdf, then we wish it to use a specific protocol (e.g. 4).
I agree that this would be a useful feature. Another use case is where I have some older data analysis code that relies on old modules that do not support python 3.8, but I would like to take advantage of new python features in simulation code that outputs a HDF5 to be read. While the workaround in the stackexchange post linked above works, it would be great to have this exposed in a more pythonic way.
What I don't get is why hdf io needs pickle-related stuff anyway? Isn't hdfs a language-independent data structure? Cause that's why I use it as a day-to-day data-storing/sharing format in the first place.
Any explanation is appreciated. 馃
Most helpful comment
Oh, to be clear, I'm not directly writing to pickle, but to HDF5. Yet, in some circumstances, it seems that the operation of writing to HDF5 uses pickling. In those cases, I couldn't care less about what pickle protocol it uses, except that I want all my clients to be able to read it.
Alas, in those cases where
DataFrame.to_hdf()resorts to some pickling, it is always withHIGHEST_PROTOCOL. Instead, in order to facilitate cross-versions of Python (within reason, of course, e.g. 3.6 to 3.8 at the moment), it would be great to letPandasandPyTableknow that, even if py38 is running, and in the case pickling is used somehow byto_hdf, then we wish it to use a specific protocol (e.g. 4).