Pandas: to_csv regression in 0.23.1

Created on 14 Jun 2018 · 12Comments · Source: pandas-dev/pandas

Writing to gzip no longer works with 0.23.1:

with gzip.open('test.txt.gz', 'wt') as f:
    pd.DataFrame([0,1],index=['a','b'], columns=['c']).to_csv(f, sep='\t')

produces corrupted output. This works fine in 0.23.0.

Presumably this is related to #21241 and #21118.

IO CSV Regression

Source

francois-a

👍3

Most helpful comment

Hi,
I also encountered a to_csv problem on 0.23.1 although my case is different to others:

import sys
import pandas as pd
df = pd.DataFrame([0,1])
df.to_csv(sys.stdout)

This code writes the dataframe to a file named <stdout> while it is expected to be printed out to the stdout.

saidie on 14 Jun 2018

👍4

All 12 comments

Please provide a reproducible example:

http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

WillAyd on 14 Jun 2018

@WillAyd Francois's example is reproducible for me on Windows 7 using master. The output file test.txt.gz is empty instead of containing data.

If I let pandas do the compression it appears to work fine:

df = pd.DataFrame([0,1],index=['a','b'], columns=['c'])
df.to_csv('C:/temp/test.txt.gz', sep='\t', compression='gzip')

Liam3851 on 14 Jun 2018

Hi,
I also encountered a to_csv problem on 0.23.1 although my case is different to others:

import sys
import pandas as pd
df = pd.DataFrame([0,1])
df.to_csv(sys.stdout)

This code writes the dataframe to a file named <stdout> while it is expected to be printed out to the stdout.

saidie on 14 Jun 2018

👍4

I also have a problem with "to_csv" specifically on 0.23.1.

Looks like function "_get_handle()" returns "f" as FD number (int) instead of buf.

            # GH 17778 handles zip compression for byte strings separately.
            buf = f.getvalue()
            if path_or_buf:
                f, handles = _get_handle(path_or_buf, self.mode,
                                         encoding=encoding,
                                         compression=self.compression)
                f.write(buf)
                f.close()

Error text:

  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 168, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'

wildraid on 14 Jun 2018

@Liam3851 thanks - I misread the original post so I see the point now.

@saidie and @wildraid please do not add distinct issues to this. If you feel you have a different issue please open it separately

WillAyd on 14 Jun 2018

@WillAyd , I did a quick research.

It seems that all "file-like" objects which cannot be converted to string file paths are affected. Gzip wrapper, stdout, FD's - all these problems have the same origin.

Example with FD:

import pandas
import os

with os.fdopen(3, 'w') as f:
    print(f)
    pandas.DataFrame([0, 1]).to_csv(f)

Output:

<_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
Traceback (most recent call last):
  File "gg.py", line 6, in <module>
    pandas.DataFrame([0, 1]).to_csv(f)
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
    formatter.save()
  File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
    f.write(buf)
AttributeError: 'int' object has no attribute 'write'

I guess, integer comes from "name" attribute of TextIOWrapper. For STDOUT it will be <stdout>, etc.

wildraid on 14 Jun 2018

👍2

I think the issue is caused by https://github.com/pandas-dev/pandas/pull/21249 in response to https://github.com/pandas-dev/pandas/issues/21227

The correct use when passing a file-handle and expect compression should be like francois's case: i.e. pass a gzip file handle or other compression archive file handle.

minggli on 14 Jun 2018

Writing to TemporaryFile fails as well. The file remains empty:

import tempfile
import pandas as pd

df = pd.DataFrame([0, 1], index=['a', 'b'], columns=['c'])
with tempfile.TemporaryFile() as f:
    df.to_csv(f)
    f.seek(0)
    print f.read()

jeffzi on 14 Jun 2018

Hi, here are some additional examples of the changes in the behaviour of to_csv.

A common use case is to write a file header once and then write many dataframes' data to that file. Our implementation looks like this:

df = pd.DataFrame({
    'col1': [1, 2, 3],
    'col2': [1.0, 2.0, 3.0],
})
df2 = ... 

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    df.to_csv(f, index=False, header=False)
    ...
    df2.to_csv(f, ...
    df3.to_csv(f, ...

This works in 0.23.0 but in 0.23.1 it produces a file that looks like this:

col1,col2
0
3,3.0

What happened here is that pandas has opened a second handle to the same file path in write mode, and our f.write line was flushed last, overwriting some of what pandas wrote.

Flushing alone would not help because now pandas will overwrite our data:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False)

produces:

1,1.0
2,2.0
3,3.0

One workaround is both flushing manually AND giving pandas a write mode:

with open('/tmp/no_headers.csv', 'w') as f:
    f.write('col1,col2\n')
    f.flush()
    df.to_csv(f, index=False, header=False, mode='a')

IMO this is not expected behaviour: if we give pandas an open file handle, we don't expect pandas to find out what the original path was, and open it again on a second file handle.

This is the bit of code where re-opening is decided: https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/csvs.py#L139 . This gives the "" behaviour pointed out by @saidie . Data is written to a StringIO first, finally the file is opened again by path and the data in the StringIO is written to it.

ernoc on 14 Jun 2018

Thanks all for the reports!
There is a PR now that tries to fix this: https://github.com/pandas-dev/pandas/pull/21478. Trying out or review of that is certainly welcome.

jorisvandenbossche on 14 Jun 2018

hello, raised a PR to remedy this issue. welcome testing and review. for reports from @francois-a and @saidie, and other reproducible, this patch should fix it.

for now a workaround would be to use file path or StringIO.

minggli on 14 Jun 2018

🎉1

Closed via #21478

WillAyd on 19 Jun 2018

Was this page helpful?

0 / 5 - 0 ratings