Writing to gzip no longer works with 0.23.1:
with gzip.open('test.txt.gz', 'wt') as f:
pd.DataFrame([0,1],index=['a','b'], columns=['c']).to_csv(f, sep='\t')
produces corrupted output. This works fine in 0.23.0.
Presumably this is related to #21241 and #21118.
Please provide a reproducible example:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
@WillAyd Francois's example is reproducible for me on Windows 7 using master. The output file test.txt.gz is empty instead of containing data.
If I let pandas do the compression it appears to work fine:
df = pd.DataFrame([0,1],index=['a','b'], columns=['c'])
df.to_csv('C:/temp/test.txt.gz', sep='\t', compression='gzip')
Hi,
I also encountered a to_csv problem on 0.23.1 although my case is different to others:
import sys
import pandas as pd
df = pd.DataFrame([0,1])
df.to_csv(sys.stdout)
This code writes the dataframe to a file named <stdout> while it is expected to be printed out to the stdout.
I also have a problem with "to_csv" specifically on 0.23.1.
Looks like function "_get_handle()" returns "f" as FD number (int) instead of buf.
# GH 17778 handles zip compression for byte strings separately.
buf = f.getvalue()
if path_or_buf:
f, handles = _get_handle(path_or_buf, self.mode,
encoding=encoding,
compression=self.compression)
f.write(buf)
f.close()
Error text:
File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
formatter.save()
File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 168, in save
f.write(buf)
AttributeError: 'int' object has no attribute 'write'
@Liam3851 thanks - I misread the original post so I see the point now.
@saidie and @wildraid please do not add distinct issues to this. If you feel you have a different issue please open it separately
@WillAyd , I did a quick research.
It seems that all "file-like" objects which cannot be converted to string file paths are affected. Gzip wrapper, stdout, FD's - all these problems have the same origin.
Example with FD:
import pandas
import os
with os.fdopen(3, 'w') as f:
print(f)
pandas.DataFrame([0, 1]).to_csv(f)
Output:
<_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
Traceback (most recent call last):
File "gg.py", line 6, in <module>
pandas.DataFrame([0, 1]).to_csv(f)
File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 1745, in to_csv
formatter.save()
File "/Users/wr/anaconda3/lib/python3.6/site-packages/pandas/io/formats/csvs.py", line 166, in save
f.write(buf)
AttributeError: 'int' object has no attribute 'write'
I guess, integer comes from "name" attribute of TextIOWrapper. For STDOUT it will be <stdout>, etc.
I think the issue is caused by https://github.com/pandas-dev/pandas/pull/21249 in response to https://github.com/pandas-dev/pandas/issues/21227
The correct use when passing a file-handle and expect compression should be like francois's case: i.e. pass a gzip file handle or other compression archive file handle.
Writing to TemporaryFile fails as well. The file remains empty:
import tempfile
import pandas as pd
df = pd.DataFrame([0, 1], index=['a', 'b'], columns=['c'])
with tempfile.TemporaryFile() as f:
df.to_csv(f)
f.seek(0)
print f.read()
Hi, here are some additional examples of the changes in the behaviour of to_csv.
A common use case is to write a file header once and then write many dataframes' data to that file. Our implementation looks like this:
df = pd.DataFrame({
'col1': [1, 2, 3],
'col2': [1.0, 2.0, 3.0],
})
df2 = ...
with open('/tmp/no_headers.csv', 'w') as f:
f.write('col1,col2\n')
df.to_csv(f, index=False, header=False)
...
df2.to_csv(f, ...
df3.to_csv(f, ...
This works in 0.23.0 but in 0.23.1 it produces a file that looks like this:
col1,col2
0
3,3.0
What happened here is that pandas has opened a second handle to the same file path in write mode, and our f.write line was flushed last, overwriting some of what pandas wrote.
Flushing alone would not help because now pandas will overwrite our data:
with open('/tmp/no_headers.csv', 'w') as f:
f.write('col1,col2\n')
f.flush()
df.to_csv(f, index=False, header=False)
produces:
1,1.0
2,2.0
3,3.0
One workaround is both flushing manually AND giving pandas a write mode:
with open('/tmp/no_headers.csv', 'w') as f:
f.write('col1,col2\n')
f.flush()
df.to_csv(f, index=False, header=False, mode='a')
IMO this is not expected behaviour: if we give pandas an open file handle, we don't expect pandas to find out what the original path was, and open it again on a second file handle.
This is the bit of code where re-opening is decided: https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/csvs.py#L139 . This gives the "
Thanks all for the reports!
There is a PR now that tries to fix this: https://github.com/pandas-dev/pandas/pull/21478. Trying out or review of that is certainly welcome.
hello, raised a PR to remedy this issue. welcome testing and review. for reports from @francois-a and @saidie, and other reproducible, this patch should fix it.
for now a workaround would be to use file path or StringIO.
Closed via #21478
Most helpful comment
Hi,
I also encountered a
to_csvproblem on 0.23.1 although my case is different to others:This code writes the dataframe to a file named
<stdout>while it is expected to be printed out to the stdout.