Pandas: to_pickle compression does not work with in-memory buffers

Created on 29 Apr 2019  路  12Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

from io import BytesIO
import pandas
pandas.DataFrame([[]]).to_pickle(BytesIO(), compression=None)  # works
pandas.DataFrame([[]]).to_pickle(BytesIO())
# ValueError: Unrecognized compression type: infer (regression in 0.24 from 0.23)
pandas.DataFrame([[]]).to_pickle(BytesIO(), compression='zip')
# AttributeError: 'NoneType' object has no attribute 'find' (in 0.24)
# BadZipFile: File is not a zip file (in 0.22 and before)

Problem description

22555 is closely related, but I believe this is a different issue because the errors occur at a different place in the code.

I believe the above is an issue because

  • Despite the argument name is "path" and the docstring reads path : string File path, the code contains multiple path_or_buf names. I'd be happy to make a PR amending the docstring if anybody confirms that the docstring is not precise.
  • The code above is actually useful (I want to let the user export a dataframe from a webapp)
  • compression='infer' failing is a regression

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-13-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.4.1
pip: 19.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Bug IO Pickle

Most helpful comment

@akhmerov I think you are correct that this is another issue as #5924

All 12 comments

Thanks for the report. I assume this is a byproduct of #22011 (cc @dhimmel). Investigation and PRs would certainly be welcome.

Is it correct that to_* methods are indended to work with anything that supports a buffer protocol?

Ah just realized that to_pickle is only documented as supporting a str argument to the path, so the fact that it worked before on a buffer was an implementation detail.

That said most of the IO methods support buffers so I think should be possible to extend that here and document accordingly

@jreback I'm not sure I follow: #5924 is about a different method (read_pickle), and also has nothing to do with compression, whereas without compression to_pickle works.

EDIT: also this issue is not about strings but buffers, #5924 doesn't seem to mention buffers at all.

I'm still having this error. When I add ", compression=None)" I get the following error instead:


TypeError Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in try_read(path, encoding)
165 warnings.simplefilter("ignore", Warning)
--> 166 return read_wrapper(lambda f: pkl.load(f))
167 except Exception: # noqa: E722

~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in read_wrapper(func)
147 try:
--> 148 return func(f)
149 finally:

~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in (f)
165 warnings.simplefilter("ignore", Warning)
--> 166 return read_wrapper(lambda f: pkl.load(f))
167 except Exception: # noqa: E722

TypeError: file must have 'read' and 'readline' attributes

During handling of the above exception, another exception occurred:

AttributeError Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in try_read(path, encoding)
172 return read_wrapper(
--> 173 lambda f: pc.load(f, encoding=encoding, compat=False))
174 # compat pickle

~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in read_wrapper(func)
147 try:
--> 148 return func(f)
149 finally:

~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in (f)
172 return read_wrapper(
--> 173 lambda f: pc.load(f, encoding=encoding, compat=False))
174 # compat pickle

~/miniconda3/lib/python3.7/site-packages/pandas/compat/pickle_compat.py in load(fh, encoding, compat, is_verbose)
219 try:
--> 220 fh.seek(0)
221 if encoding is not None:

~/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068

AttributeError: 'DataFrame' object has no attribute 'seek'

This is the error I get without adding compression=None

"---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in try_read(path, encoding)
165 warnings.simplefilter("ignore", Warning)
--> 166 return read_wrapper(lambda f: pkl.load(f))
167 except Exception: # noqa: E722

~/miniconda3/lib/python3.7/site-packages/pandas/io/pickle.py in read_wrapper(func)
145 compression=compression,
--> 146 is_text=False)
147 try:

~/miniconda3/lib/python3.7/site-packages/pandas/io/common.py in _get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text)
412 msg = 'Unrecognized compression type: {}'.format(compression)
--> 413 raise ValueError(msg)
414

ValueError: Unrecognized compression type: infer"

@dqii are you using the minimal code snippet I shared above? What is your pandas version and Python version?

Sorry on reflection I realized my error might be different. I was saving a large pandas dataframe. My pandas version is 0.24.2 and my Python version is 3.7.3. I made a separate thread for my issue in #27029. Sorry about that!

@akhmerov I think you are correct that this is another issue as #5924

I agree with WillAyd, to_pickle() should accept file buffers as well. It seems like it did in pandas 0.24.2 (despite the documentation) but with 0.25.0 it does not anymore.

The original bug, to_pickle() to a buffer not working with compression='infer' appears to still be broken in the current dev branch, and the fix seems to be very simple. If there isn't a reason it hasn't been fixed, I can provide a PR.

Was this page helpful?
0 / 5 - 0 ratings