Pandas: Error when writing very big dataframe to csv, with gzip compression

Created on 6 Jun 2018 · 5Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

df.to_csv('file.txt.gz', sep='\t', compression='gzip')

Problem description

I receive this error while writing to file a very big dataframe:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-28-48e45479ccfb> in <module>()
----> 1 df.to_csv('file.txt.gz', sep='\t', compression='gzip')

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

~/.pyenv/versions/3.6.5/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    156                 f.close()
    157                 with open(self.path_or_buf, 'r') as f:
--> 158                     data = f.read()
    159                 f, handles = _get_handle(self.path_or_buf, self.mode,
    160                                          encoding=encoding,

OSError: [Errno 22] Invalid argument

I cannot disclose the data but by running df.info() I received this information:

<class 'pandas.core.frame.DataFrame'>
Index: 10319 entries, Sample1 to Sample10319
Columns: 33707 entries, A1BG to ZZZ3
dtypes: float64(33707)
memory usage: 2.6+ GB

When looking at the disk, the dataframe has probably been dumped incompletely, and not compressed.

I am working with 16G of RAM on macOS 10.13.4 (17E202).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Source

fbrundu

Most helpful comment

I'm also having this problem with the current version.
Having to do this workaround:

n = 10000
list_df = [data[i:i+n] for i in range(0, data.shape[0], n)]

list_df[0].to_csv("data/iob.csv", index=False)

for l in list_df[1:]: 
    l.to_csv("data/iob.csv", index=False, header=False, mode='a')

hannahlindsley on 6 Jul 2018

👍6

All 5 comments

I can confirm that to_csv() fails when compressing the data on disk: I dumped the dataframe without compression and I ended up with the same file (as the one in which the compression fails) - the md5 sum is the same. So, the file is the dataframe uncompressed on disk.

I am able to use gzip for compressing the file once it is written on disk but in this case the size is around 3G.

fbrundu on 6 Jun 2018

Can't reproduce anything from the code provided but this is most likely solved by #21300. Please try on master and reopen with something reproducible if that doesn't solve it for you

WillAyd on 6 Jun 2018

I have the same issue with very large file without compression:

2018-06-25 12:44:27,378|root|64215|MainProcess|CRITICAL| Exception Information
2018-06-25 12:44:27,380|root|64215|MainProcess|CRITICAL| Type: <class 'OSError'>
2018-06-25 12:44:27,381|root|64215|MainProcess|CRITICAL| Value: [Errno 22] Invalid argument

File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)

VelizarVESSELINOV on 25 Jun 2018

👍2

I have the same issue as @VelizarVESSELINOV also with version 0.23.1 which, if I understood, should solve thanks to #21300 . I'm not using compression from pandas, only to_csv().

fbrundu on 29 Jun 2018

👍1

I'm also having this problem with the current version.
Having to do this workaround:

n = 10000
list_df = [data[i:i+n] for i in range(0, data.shape[0], n)]

list_df[0].to_csv("data/iob.csv", index=False)

for l in list_df[1:]: 
    l.to_csv("data/iob.csv", index=False, header=False, mode='a')

hannahlindsley on 6 Jul 2018

👍6

Was this page helpful?

0 / 5 - 0 ratings