Pandas: error in to_pickle

Created on 26 Nov 2015 · 16Comments · Source: pandas-dev/pandas

Hi,
I think there is a bug in to_pickle

>>> data = pd.read_hdf(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')
>>> data.shape
(1006, 288095)
>>> data.to_pickle(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping2.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 994, in to_pickle
    return to_pickle(self, path)
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
    pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> pd.version.version
'0.16.2'
>>>

Compat IO Data

Source

DonBeo

Most helpful comment

I had the same problem, the same error for pickling a large file.
I solved the issue using this code:

import joblib
joblib.dump(df,'df_path.pkl')

So maybe a pandas' issue?

shadzic on 26 Oct 2017

👍2

All 16 comments

what is DG.load_path exactly? there is slightly different filepath processing with the 2 functions (FYI this has been somewhat improved in 0.17.0), so pls give a try there as well

jreback on 26 Nov 2015

DG.load_path is simply as a string containing the directory where the data is stored. I'll try to update to the new version.

DonBeo on 26 Nov 2015

>>> data = pd.read_hdf(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')

>>> ... 
>>> data.shape
(1006, 288095)
>>> DG.load_path
'/Users/donbeo/Documents/datasets/'
>>> data.to_pickle(DG.load_path + 'time_series/PSI_TS_reshaped2.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
    return to_pickle(self, path)
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
    pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> pd.__version__
'0.17.1'
>>>

DonBeo on 26 Nov 2015

Is your frame really VERY wide? why would you store it that way?

This is an OS system error, what OS? this indicates a path that is not accepted, usually because it has invalid characters or is too long. Try just writing directly to a filename and see.

jreback on 27 Nov 2015

My Frame is very wide. It has 1006 observations of 288000+ variables. I assume that reading a pkl file should be faster. This is way I am storing it in this way. I am running on a mac book pro 15 with OS El Captain 10.11.1

I don't think the error is related to the file path

>>> data = pd.read_hdf('/Users/donbeo/Documents/datasets/time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')
>>> data.to_pickle('/Users/donbeo/Documents/datasets/time_series/PSI_TS_vpkl.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
    return to_pickle(self, path)
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
    pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> data.T.to_pickle('/Users/donbeo/Documents/datasets/time_series/PSI_TS_vpkl.pkl')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
    return to_pickle(self, path)
  File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
    pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument

DonBeo on 27 Nov 2015

so run this in the debugger and see what f is at this point. I cannot reproduce this.

jreback on 29 Nov 2015

I'm encountering this same issue – versions below. I have a DataFrame with 11,183,804 rows, and I'm trying to run to_pickle – see the output below. I get a similar error from to_msgpack, but not for to_csv (which created a 4 GB file).

When I ran the debugger, I got for f:
<_io.BufferedWriter name='/tmp/tmp.dat'>

What additional information would be useful?

smh_tle_df.to_pickle('/tmp/tmp.df')
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-28-eecdfc26b66e> in <module>()
----> 1 smh_tle_df.to_pickle('/tmp/tmp.df')

/Users/fodonovan/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in to_pickle(self, path)
   1013         """
   1014         from pandas.io.pickle import to_pickle
-> 1015         return to_pickle(self, path)
   1016 
   1017     def to_clipboard(self, excel=None, sep=None, **kwargs):

/Users/fodonovan/anaconda3/lib/python3.5/site-packages/pandas/io/pickle.py in to_pickle(obj, path)
     12     """
     13     with open(path, 'wb') as f:
---> 14         pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
     15 
     16 

OSError: [Errno 22] Invalid argument

Python version 3.5.1 |Anaconda 2.4.1 (x86_64)| (default, Dec  7 2015, 11:24:55) 
[GCC 4.2.1 (Apple Inc. build 5577)]
pandas version 0.17.1

proinsias on 28 Jan 2016

this is not an error in pandas, rather with your path

jreback on 28 Jan 2016

Could you give a little more information than that?

This code works:

df2 = pd.DataFrame(np.random.randn(10, 5))
df2.to_pickle('/tmp/tmp.dat')

This does not, using the same 11,183,804-row DataFrame as before:

smh_tle_df.to_pickle('/tmp/tmp.dat')

What's the difference? The paths are the same.

proinsias on 28 Jan 2016

no idea, you'd have to debug it. pickle is very opaque.

jreback on 28 Jan 2016

probably not a pandas bug. check this out http://bugs.python.org/issue24658

kawochen on 28 Jan 2016

👍2

so I guess what is happening is the file is too big and raises an unhelpful exception

jreback on 28 Jan 2016

👍1

OSX only too

kawochen on 28 Jan 2016

Thanks!

proinsias on 29 Jan 2016

I had the same problem, the same error for pickling a large file.
I solved the issue using this code:

import joblib
joblib.dump(df,'df_path.pkl')

So maybe a pandas' issue?

shadzic on 26 Oct 2017

👍2

I got the same error with Pandas 20.3 and OSX Sierra. I think it has to do with file size: The error occurred with a 3.5G DataFrame, but didn't occur when I reduced the file to 1.3G by converting floats to ints.