the DataFrame.to_csv method seems to accept a "compression" named parameter:
import numpy as np,pandas as pd
data=np.arange(10).reshape(5,2)
df=pd.DataFrame(data,columns=['a','b'])
df.to_csv('test.csv.gz',compression='gzip')
However, the file it creates is not compressed at all:
francesco@i3 ~/Desktop $ cat test.csv.gz
,a,b
0,0,1
1,2,3
2,4,5
3,6,7
4,8,9
How about either (i) actually implementing compression, or at least (ii) raise an error? The current behavior is confusing...
to_csv allows **kwds so arbitrary additional arguments are 'accepted' (this is mainly for compatibility IIRC with some of the other to_* functions which allow this), but ignored. I suppose that could be removed (not sure why it was their in the first place). That said, only arguments in the doc-string are public.
Would accept a pull-request to limit this.
I think this out-of-scope for Pandas--just use this: https://docs.python.org/2/library/gzip.html
please close
But compression='gzip' is accepted (and enacted) in pd.read_csv, which is why I was assuming to_csv behaves the same.
The way you initially phrased the issue suggested that you were just guessing at keyword arguments--'compression' isn't a documented argument so I don't think your confusion is shared by many. You're welcome to submit a pull-request, I don't feel religious about this at all
Sorry, what I meant is:
There is no error in the documentation and both (1) and (2) make sense to me.
It's (1) and (2) together, i.e., the fact that to_csv behaves differently from read_csv without telling the user, that seemed a bit inconsistent to me.
Closing on the grounds that I won't be fixing it myself, and probably it's not a proper bug.
Thanks! I definitely didn't mean to antagonize you--agreed that it's an unfortunate inconsistency
Would we want this feature, if someone would implement it? If so, we can leave it open marked as an enhancement proposal?
I would also like to_csv to have the same functionality of from_csv.
+1, a compression argument for DataFrame.to_csv would spare many user headaches.
In Python 3.4, I use the following workaround:
with gzip.open('path_to_file', 'wt') as write_file:
data_frame.to_csv(write_file)
@dhimmel If you're interested in putting in the work, I think we're still open to a PR to add this feature.
@shoyer, okay I will keep this in mind. I have a bit to learn first.
Thank you so much for implementing this! Besides the aesthetics POV and fixing the asymmetry between read/write, this is a huge improvement to some people like me.
This did not work for me, the output file isn't compressed. I'm using Pandas 0.18.1
@jsmedmar could you open a new issue with that demonstrating the problem? Thanks.
@jsmedmar I see the "compression" argument is properly documented and it is working
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.DataFrame.to_csv.html
One confusing thing is that if you run the following code
import numpy as np
import pandas as pd
data = np.arange(10).reshape(5, 2)
df = pd.DataFrame(data, columns=['a', 'b'])
print(df)
df.to_csv('test.csv.gz', compression='gzip')
"""
a b
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
"""
you get a compressed file, but opening it in vim automatically decompresses it, so to verify that compression happened use the "head" command:
$ head test.csv.gz
D5X�test.csv�70
Most helpful comment
@jsmedmar I see the "compression" argument is properly documented and it is working
http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.DataFrame.to_csv.html
One confusing thing is that if you run the following code
you get a compressed file, but opening it in vim automatically decompresses it, so to verify that compression happened use the "head" command:
$ head test.csv.gz
D5X�test.csv�70