Pandas: In-memory to_csv compression

Created on 31 Aug 2018  路  11Comments  路  Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

# Attempt 1
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
test = df.to_csv(compression="gzip")
type(test)
RuntimeWarning: compression has no effect when passing file-like object as input.
Out: str
# Attempt 2
from io import BytesIO
b_buf = BytesIO()
df.to_csv(b_buf, compression="gzip")
Out: TypeError: a bytes-like object is required, not 'str'

Problem description

I am trying to gzip compress a dataframe in memory (as opposed to directly to a named file location). The use case for this is (I imagine) similar to the reason by to_csv now allows not specifying a path in other cases to create an in memory representation, but specifically my case is that I need to save the compressed df to a cloud location using a custom URI, and I'm temporarily keeping it in memory for that purpose.

Expected Output

I would expect the compression option to result in a compressed, bytes object (similar to the gzip library).

Thank you in advance for your help!

Note: I originally saw #21227 (df.to_csv ignores compression when provided with a file handle), and thought it might have also been a fix, but looks like it just stopped a little short of fixing my issue as well.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Enhancement IO CSV

Most helpful comment

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn't ideal (can't be sure what the end-user's local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())

All 11 comments

Interesting proposal! As a (temporary) workaround, could you not save to disk and then read into memory by any chance?

BTW, if you have ideas on how to implement in-memory compression, go for it!

Hey - thanks for the reply @gfyoung , and sorry for my delay in replying. The functions where I use this are part of a library, so temporarily saving to disk isn't ideal (can't be sure what the end-user's local environment will look like).

My thought was something like this as a workaround:

import gzip
from io import BytesIO
import pandas as pd

df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12]})
b_buf = BytesIO()
with gzip.open(b_buf, 'wb') as f:
    f.write(df.to_string().encode())

@ZaxR : Gotcha. Interesting...I think it would be a solid enhancement nonetheless. Would be open to proposals for implementation.

I agree that the compression argument should take effect when a file-like object is passed. This enhancement would likely include implementing zip writing support in _get_handle, which would address the frustration I had in https://github.com/pandas-dev/pandas/pull/22011#issuecomment-406831478.

For the 'gzip' compression, _get_handle() is not being called when BytesIO() is passed. This causes it to fail at csv.writer(gzip.GzipFile(fileobj=BytesIO())) in csvs.py

If _get_handle() is called on BytesIO() then what happens is csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO()))) which fails b/c GzipFile opens it as read only. Setting the mode will work csv.writer(TextIOWrapper(gzip.GzipFile(fileobj=BytesIO(), mode=mode)))

The 'bz2' compression fix is the same. 'xz' will not compress BytesIO() unless LZMACompressor is used. 'zip' has the custom workflow referenced by dhimmel, which complicates it further.

There is too much logic in _get_handle, and it is called many times for reading and for writing. One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

In order to actually call _get_handle()on BytesIO(), the elif hasattr(self.path_or_buf, 'write') in csvs.py has to be changed so that BytesIO() doesn't end up there but StringIO() does. For Python 3 this is enough to fix it.

For Python 2, the exception about not supporting a custom encoding gets raised in _get_handle. This is b/c CSVFormater() sets encoding='ascii' while _get_handle expects it to be None which is actually 'ascii'.

This is the test code I was using:

hello = BytesIO()
test = df.to_csv(hello, compression='gzip')
print(hello.getvalue())

There is too much logic in _get_handle, and it is called many times for reading and for writing.

I agree. Especially since there is no docstring to define what the function intends to support.

One idea is for it to call _get_read_handle and _get_write_handle to split the logic. Or _get_handle_python2 and _get_handle_python3 could be an option.

I agree it may be helpful to split read/write and 2/3. However, with 2019 only a couple months away, the purging of 2 from pandas is just around the corner. It seems like any major changes should plan for a 3-only codebase (and hence benefit from the great simplification)?

@dhimmel : I agree that the implementation should be 3-oriented. If it's 2-oriented as well, great! If not, I would still write it out but just hold off on the PR until the turn of the calendar year.

@dhimmel and @gfyoung - Now that we've reached the new year/pandas release, any update on this enhancement? Would really love this capability :)

@ZaxR : Thanks for the ping! We're still in the process of releasing versions that are Python-2-compatible, so we might want to hold on this a little longer. That being said, proposals are a pure Python-3-compatible implementation would be great 馃憤

I was also looking for the compress functionality in order to produce base64 encoded links.

@staticmethod
def _to_base64_encoded_link(data: pd.DataFrame):
    csv = data.to_csv(index=False)
    b64 = base64.b64encode(
        csv.encode()
    ).decode()  # some strings <-> bytes conversions necessary here
    link = f'<a href="data:file/csv;base64,{b64}" download="data.csv">Download</a>'
    return link

Currently my dataframes are to big. So I would like to compress them.

I modified some code which seems to produce the bz2 for me and appears to avoid the intermediate on-disk csv file. Does this help you guys? I'm not sure how efficient the memory allocation is or how in-memory it all is, so if there are optimizations, please advise! I'm a py newbie (first day), and this stuff seemed to be poorly documented, so thanks for making this thread and apologies for my ignorance.

This code takes a pandas df and makes clickable link in your ipynb UI to download a bz2 compressed file

Updated 2020-05-19 davidkh1255

Housekeeping - BEGIN

import pandas as pd
import bz2
import base64
from IPython.display import HTML

Housekeeping - END

Create test pandas dataframe from example in 22555, and add D col and data

mypd = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8], "C": [9, 10, 11, 12], "D":[90,91,92,93]})

Note: this requires a pandas df as input

def create_download_link( df, title = "Download bz2 file", filename = "data.bz2"):
csv = df.to_csv(index=False) # turn off index numbers
bz = bz2.BZ2Compressor() # setup the BZ2Compressor
#b64 = base64.b64encode(csv.encode()) # to not use bz2, uncomment this and comment out next line
b64 = base64.b64encode(bz.compress(csv.encode())+bz.flush()) # do not forget to flush otherwise output will be zero byte file
payload = b64.decode()
html = '{title}'
html = html.format(payload=payload,title=title,filename=filename)
return HTML(html)

END def create_download_link

Call the function with your pandas df

create_download_link(mypd)

Was this page helpful?
0 / 5 - 0 ratings