Pandas: Python 3 writing to_csv file ignores encoding argument.

Created on 3 May 2016 · 32Comments · Source: pandas-dev/pandas

# is missing the UTF8 BOM (encoded with default encoding UTF8)
with open('path_to_f', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

# is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)
df.to_csv('path_to_f', encoding='utf-8-sig')

I expect:

with open('path_to_f', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

To crash with TypeError: write() argument must be str, not bytes

and I expect:

with open('path_to_f', 'wb') as f:
    df.to_csv(f, encoding='utf-8-sig')

To write the file correctly.

Copy pasta

#!/usr/bin/env python3
import pandas as pd
df = pd.DataFrame()
with open('file_one', 'w') as f:
    df.to_csv(f, encoding='utf-8-sig')

assert open('file_one', 'rb').read() == b'""\n'

# is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)
df.to_csv('file_two', encoding='utf-8-sig')
assert open('file_two', 'rb').read() == b'\xef\xbb\xbf""\n'

Bug Error Reporting IO CSV Unicode

Source

graingert

Most helpful comment

Hi !
I am having troubles with Python 3 writing to_csv file ignoring encoding argument too.

To be more specific, the problem comes from the following code (modified to focus on the problem and be copy pastable):

df = pd.DataFrame([['a', 'é']])
with open('df_to_csv_utf8.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='utf-8')
with open('df_to_csv_latin1.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='latin1')

If run with python2, I actually get two files with different encoding:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
ISO-8859 text

But with python3, they both are utf-8 encoded:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
UTF-8 Unicode text

I know magic only guesses the encoding, but this seemed a clear way of showing the difference.

A better proof is to try to decode the text written in the files. Using python codecs module, you get:

python2:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print f.read()
0,1
a,é

python3:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print(f.read())
0,1
a,Ã©

For the record, using LibreOffice calc to try to open both files gives the same result: the file written with python3 using latin1 encoding cannot be opened properly when you specify latin1 encoding, it must be opened with utf-8 encoding to be displayed correctly.

@graingert @jreback have you progressed on this subject ?

For information, I use:

|python2 |python3 |
|-----------------------------|-----------------------------|
|commit: None |commit: None |
|python: 2.7.13.final.0 |python: 3.5.3.final.0 |
|python-bits: 64 |python-bits: 64 |
|OS: Linux |OS: Linux |
|OS-release: 4.10.0-33-generic|OS-release: 4.10.0-33-generic|
|machine: x86_64 |machine: x86_64 |
|processor: x86_64 |processor: x86_64 |
|byteorder: little |byteorder: little |
|LC_ALL: None |LC_ALL: None |
|LANG: en_US.UTF-8 |LANG: en_US.UTF-8 |
|LOCALE: None.None |LOCALE: en_US.UTF-8 |
|pandas: 0.20.3 |pandas: 0.20.3 |
|pytest: None |pytest: None |
|pip: 9.0.1 |pip: 9.0.1 |
|setuptools: 36.4.0 |setuptools: 36.4.0 |
|Cython: None |Cython: None |
|numpy: 1.13.1 |numpy: 1.13.1 |
|scipy: None |scipy: None |
|xarray: None |xarray: None |
|IPython: 5.4.1 |IPython: 6.1.0 |
|sphinx: None |sphinx: None |
|patsy: None |patsy: None |
|dateutil: 2.6.1 |dateutil: 2.6.1 |
|pytz: 2017.2 |pytz: 2017.2 |
|blosc: None |blosc: None |
|bottleneck: None |bottleneck: None |
|tables: None |tables: None |
|numexpr: None |numexpr: None |
|feather: None |feather: None |
|matplotlib: None |matplotlib: None |
|openpyxl: None |openpyxl: None |
|xlrd: None |xlrd: None |
|xlwt: None |xlwt: None |
|xlsxwriter: None |xlsxwriter: None |
|lxml: None |lxml: None |
|bs4: None |bs4: None |
|html5lib: 0.999999999 |html5lib: 0.999999999 |
|sqlalchemy: None |sqlalchemy: None |
|pymysql: None |pymysql: None |
|psycopg2: None |psycopg2: None |
|jinja2: 2.9.6 |jinja2: 2.9.6 |
|s3fs: None |s3fs: None |
|pandas_gbq: None |pandas_gbq: None |
|pandas_datareader: None |pandas_datareader: None |

JulieRossi on 13 Sep 2017

👍3

All 32 comments

you would have to show a reproducible example. why does this have to do with excel? you are reporting a csv issue, no? excel being able to read something doesn't prove (or disprove) anything.

jreback on 3 May 2016

futhermore show pd.show_versions(), a sample of the frame, df.info() as well.

jreback on 3 May 2016

@jreback updated issue to remove Excel problem

graingert on 3 May 2016

so will still need a copy-pastable example.

jreback on 3 May 2016

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.0
nose: None
pip: 8.1.1
setuptools: 21.0.0
Cython: None
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None

graingert on 3 May 2016

@jreback updated with copy pastable

graingert on 3 May 2016

This looks to be a design flaw in all "io" outputs that take encodings and file objects on Python 3.

graingert on 3 May 2016

Python 3.5.1 |Continuum Analytics, Inc.| (default, Dec  7 2015, 11:24:55) 

In [1]: 

In [1]: df = pd.DataFrame()

In [2]: with open('file_one', 'w') as f:
   ...:         df.to_csv(f, encoding='utf-8-sig')
   ...:     

In [3]: assert open('file_one', 'rb').read() == b'""\n'

In [4]: 

In [4]: # is not missing the UTF8 BOM (encoded with passed encoding utf-8-sig)

In [5]: df.to_csv('file_two', encoding='utf-8-sig')

In [6]: assert open('file_two', 'rb').read() == b'\xef\xbb\xbf""\n'

In [7]: pd.__version__
Out[7]: '0.18.1'

what's the problem?

jreback on 3 May 2016

works on 0.18.0 as well.

jreback on 3 May 2016

The first call ignores the encoding... The first assert should fail
On 3 May 2016 18:54, "Jeff Reback" [email protected] wrote:

works on 0.18.0 as well.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/13068#issuecomment-216611355

graingert on 3 May 2016

hmm, you are opening it in text mode. Not really sure if a stream indicates its text or binary. I don't know that this is a bug on pandas side. Can you repro using non-pandas?

jreback on 3 May 2016

If I open the file in binary mode, pandas tries to write str to the file
and crashes
On 3 May 2016 19:06, "Jeff Reback" [email protected] wrote:

hmm, you are opening it in text mode. Not really sure if a stream
indicates its text or binary. I don't know that this is a bug on pandas
side. Can you repro using non-pandas?

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/13068#issuecomment-216614706

graingert on 3 May 2016

can j show what happens? eg that would be the test

jreback on 3 May 2016

!/usr/bin/env python3

import pandas as pd
df = pd.DataFrame()
with open('file_one', 'wb') as f:
df.to_csv(f, encoding='utf-8-sig')
* crash *

On 3 May 2016 19:57, "Jeff Reback" [email protected] wrote:

can j show what happens? eg that would be the test

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
https://github.com/pydata/pandas/issues/13068#issuecomment-216630408

graingert on 3 May 2016

ahh I see now. ok, it prob needs to be opened with a codec, so when the stream is created it should be inserted there. since you are familiar, want to do a PR?

jreback on 3 May 2016

looks really similar to https://github.com/pydata/pandas/issues/9712

jreback on 3 May 2016

@jaidevd what's the desired behaviour? Crash on passing a Unicode writer, or deprecate the encoding keyword argument in favour of passing Unicode writers only?

graingert on 3 May 2016

no i think I would raise a more informative message. If a user wants to pass a non-compat stream (and we can't do anything with it), then must raise. _most_ usage does not pass a stream when writing.

jreback on 3 May 2016

@jreback so crash if a unicode accepting stream is passed, and raise an informative error.

graingert on 3 May 2016

well if passed a non unicode accepting stream when an encoding is passed I guess. I don't think their is a way to fix it? raising an exception that is helpful is just fine.

jreback on 3 May 2016

@jreback we need the to_csv and related functions to support: either binary file objects and the encoding argument; or unicode objects without the encoding argument.

graingert on 3 May 2016

I thought that's what I said.

jreback on 3 May 2016

@jreback I thought you meant the status quo: neither binary file objects and the encoding argument; or unicode objects without the encoding argument. But with better exceptions.

graingert on 3 May 2016

oh you are saying 2 issues. I didn't really look too closely. I am all for writing things correctly, or raising if its incorrect. As I said I suspect we have _very_ little testing on writing unicode with streams now (maybe no tests), esp with alternate encodings. This is quite uncommon.

Would be ok with complete tests and write if possible, raising if not.

jreback on 3 May 2016

so always write bytes, regardless of Python version. With nice exceptions when writing to unicode streams.

graingert on 3 May 2016

no, I believe the existing impl in py2 is correct. Write out tests for all cases, test them under both versions and you will have the answer.

jreback on 3 May 2016

Hi !
I am having troubles with Python 3 writing to_csv file ignoring encoding argument too.

To be more specific, the problem comes from the following code (modified to focus on the problem and be copy pastable):

df = pd.DataFrame([['a', 'é']])
with open('df_to_csv_utf8.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='utf-8')
with open('df_to_csv_latin1.csv', 'w') as f:
    df.to_csv(f, index=False, encoding='latin1')

If run with python2, I actually get two files with different encoding:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
ISO-8859 text

But with python3, they both are utf-8 encoded:

>>> magic.from_file('df_to_csv_utf8.csv')
UTF-8 Unicode text
>>> magic.from_file('df_to_csv_latin1.csv')
UTF-8 Unicode text

I know magic only guesses the encoding, but this seemed a clear way of showing the difference.

A better proof is to try to decode the text written in the files. Using python codecs module, you get:

python2:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print f.read()
0,1
a,é

python3:

>>> with codecs.open('df_to_csv_latin1.csv', encoding='latin1') as f:
>>>     print(f.read())
0,1
a,Ã©

@graingert @jreback have you progressed on this subject ?

For information, I use:

JulieRossi on 13 Sep 2017

👍3

I am confused by this. You are doing
with open('df_to_csv_latin1.csv', 'w') as f: - which sets the encoding of f to your systems default encoding, which on your machine is likely to be UTF-8 (check with locale.getpreferredencoding()).

If you want to write in latin1, why don't you just open the file in latin1?

df = pd.DataFrame([['a', 'é']])
with open('df_to_csv_utf8.csv', 'w', encoding='utf-8') as f:
    df.to_csv(f, index=False)
with open('df_to_csv_latin1.csv', 'w', encoding='latin1') as f:
    df.to_csv(f, index=False)

with open('df_to_csv_latin1.csv', encoding='latin1') as f:
    print(f.read())

prints

0,1
a,é

as expected.
(Hint: if you are working with excel, you may want to write as utf-8-sig to write the UTF-8 BOM so that excel actually knows its UTF-8.)

watercrossing on 12 Dec 2017

👍1

Hi @watercrossing
I cannot specify the encoding when opening the file because I receive an already opened file descriptor.
Otherwise, this would indeed be the solution.

My point is that you can specify an encoding in to_csv method but it is not taken into account (which was not the case with python 2).

There may be no solution but it is confusing: you can set an option that is (quietly) not used.

JulieRossi on 12 Dec 2017

@anuraagmadhavRacherla please don't hijack unrelated issues. This question would be better suited for stackoverlfow. Also you've striped all the useful debugging information from your exception (the stack trace).

graingert on 3 Apr 2018

data_frame.to_excel will need 'wb' and to_csv will need 'w' for my part.

pandas==0.22.0

shm007g on 26 Jun 2018

Hi folks, I wrote an article on my blog on how to Support Binary File Objects with pandas.DataFrame.to_csv. At the end of the article I added a monkey patch I think can also be used as a work around for this problem. Hope this helps until this is resolved in pandas.