Hi,
I think there is a bug in to_pickle
>>> data = pd.read_hdf(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')
>>> data.shape
(1006, 288095)
>>> data.to_pickle(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping2.pkl')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 994, in to_pickle
return to_pickle(self, path)
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> pd.version.version
'0.16.2'
>>>
what is DG.load_path exactly? there is slightly different filepath processing with the 2 functions (FYI this has been somewhat improved in 0.17.0), so pls give a try there as well
DG.load_path is simply as a string containing the directory where the data is stored. I'll try to update to the new version.
>>> data = pd.read_hdf(DG.load_path + 'time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')
>>> ...
>>> data.shape
(1006, 288095)
>>> DG.load_path
'/Users/donbeo/Documents/datasets/'
>>> data.to_pickle(DG.load_path + 'time_series/PSI_TS_reshaped2.pkl')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
return to_pickle(self, path)
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> pd.__version__
'0.17.1'
>>>
Is your frame really VERY wide? why would you store it that way?
This is an OS system error, what OS? this indicates a path that is not accepted, usually because it has invalid characters or is too long. Try just writing directly to a filename and see.
My Frame is very wide. It has 1006 observations of 288000+ variables. I assume that reading a pkl file should be faster. This is way I am storing it in this way. I am running on a mac book pro 15 with OS El Captain 10.11.1
I don't think the error is related to the file path
>>> data = pd.read_hdf('/Users/donbeo/Documents/datasets/time_series/PSI_TS_after_alignment_and_reshaping.h5', 'table')
>>> data.to_pickle('/Users/donbeo/Documents/datasets/time_series/PSI_TS_vpkl.pkl')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
return to_pickle(self, path)
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
>>> data.T.to_pickle('/Users/donbeo/Documents/datasets/time_series/PSI_TS_vpkl.pkl')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/core/generic.py", line 1015, in to_pickle
return to_pickle(self, path)
File "/Users/donbeo/MyApps/phd_python/lib/python3.4/site-packages/pandas/io/pickle.py", line 14, in to_pickle
pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
so run this in the debugger and see what f is at this point. I cannot reproduce this.
I'm encountering this same issue – versions below. I have a DataFrame with 11,183,804 rows, and I'm trying to run to_pickle – see the output below. I get a similar error from to_msgpack, but not for to_csv (which created a 4 GB file).
When I ran the debugger, I got for f:
<_io.BufferedWriter name='/tmp/tmp.dat'>
What additional information would be useful?
smh_tle_df.to_pickle('/tmp/tmp.df')
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-28-eecdfc26b66e> in <module>()
----> 1 smh_tle_df.to_pickle('/tmp/tmp.df')
/Users/fodonovan/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py in to_pickle(self, path)
1013 """
1014 from pandas.io.pickle import to_pickle
-> 1015 return to_pickle(self, path)
1016
1017 def to_clipboard(self, excel=None, sep=None, **kwargs):
/Users/fodonovan/anaconda3/lib/python3.5/site-packages/pandas/io/pickle.py in to_pickle(obj, path)
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
15
16
OSError: [Errno 22] Invalid argument
Python version 3.5.1 |Anaconda 2.4.1 (x86_64)| (default, Dec 7 2015, 11:24:55)
[GCC 4.2.1 (Apple Inc. build 5577)]
pandas version 0.17.1
this is not an error in pandas, rather with your path
Could you give a little more information than that?
This code works:
df2 = pd.DataFrame(np.random.randn(10, 5))
df2.to_pickle('/tmp/tmp.dat')
This does not, using the same 11,183,804-row DataFrame as before:
smh_tle_df.to_pickle('/tmp/tmp.dat')
What's the difference? The paths are the same.
no idea, you'd have to debug it. pickle is very opaque.
probably not a pandas bug. check this out http://bugs.python.org/issue24658
so I guess what is happening is the file is too big and raises an unhelpful exception
OSX only too
Thanks!
I had the same problem, the same error for pickling a large file.
I solved the issue using this code:
import joblib
joblib.dump(df,'df_path.pkl')
So maybe a pandas' issue?
I got the same error with Pandas 20.3 and OSX Sierra. I think it has to do with file size: The error occurred with a 3.5G DataFrame, but didn't occur when I reduced the file to 1.3G by converting floats to ints.
Most helpful comment
I had the same problem, the same error for pickling a large file.
I solved the issue using this code:
So maybe a pandas' issue?