Pandas: to_csv and bytes on Python 3.

Created on 23 Mar 2015  路  12Comments  路  Source: pandas-dev/pandas

Is this desired behavior and something I need to work around or a bug? Notice the byte type marker is written to disk so you can't round-trip the data. This works fine in Python 2 with unicode AFAICT.

In [1]: pd.__version__
Out[1]: '0.15.2-252-g0d35dd4'

In [2]: pd.DataFrame.from_dict({'a': ['a', 'b', 'c']}).a.str.encode("utf-8").to_csv("tmp.csv")

In [3]: !cat tmp.csv
0,b'a'
1,b'b'
2,b'c'
Bug IO CSV Unicode

Most helpful comment

@zhuoqiang What I think you meant is you have to do this:

df['Column'] = df['Column'].str.decode('ascii') # or utf-8 etc.

Simply doing astype(str) doesn't help--the to_csv() output still contains b'...' wrappers.

All 12 comments

I'd say this is not intended, but I haven't worked on this part of the code. It's being written to file anyway, so (python 3) bytes written to csv should be identical to (python 3) str.

FWIW I think that's actually the output I'd expect in 3.

I guess I would expect behavior similar to

with open('tmp.txt', 'wb') as f:
    f.write('abc'.encode('utf-8'))

which doesn't have the b prefix.

The caveat here is that you have to explicitly open the file in wb mode since you're writing bytes. That can't work for DataFrames (I don't think) since you could have a mix of bytes and strs across columns. Do we support wb mode in to_csv? I get an error when we try to open the file handle.

I think you just need to pass the encoding argument when writing it (otherwise it defaults to ascii on py2 and utf-8 on py3). This is from py2

In [2]: from pandas.compat import u

In [3]: df = DataFrame({u('c/\u03c3'): [1, 2, 3]})

In [4]: df
Out[4]: 
   c/?
0    1
1    2
2    3

In [5]: df.to_csv('tmp.csv',mode='w',encoding='UTF_8')

In [6]: !cat tmp.csv
,c/脧
0,1
1,2
2,3

It'd better that padas have a configurable parameter in to_csv() so that people could control how to render bytes in csv file. Otherwise we have to manally convert bytes to string before io output

df['Column'] =df['Column'].astype(str)
df.to_csv('output.csv')

I have this problem also. Here's a trivial example that I think most regular users would expect to work differently:

>>> import pandas as pd
>>> import sys
>>> pd.Series([b'x',b'y']).to_csv(sys.stdout)
0,b'x'
1,b'y'

>>> pd.__version__
'0.18.1'

That is, the CSV is created with Python-specific b prefixes, which other programs don't know what to do with. CSV is not just a Python data interchange format, it's what a ton of people use to dump their data into other systems, and the above should "just work" the same as it does in Python 2:

0,x
1,y

@zhuoqiang What I think you meant is you have to do this:

df['Column'] = df['Column'].str.decode('ascii') # or utf-8 etc.

Simply doing astype(str) doesn't help--the to_csv() output still contains b'...' wrappers.

I totally agree with @jzwinck.
How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

When you use pd.read_csv() and an _Array-protocol type strings_ dtype round tripping gets messed up:

import pandas as pd
fname = './blah.csv'
pd.Series([b'x',b'y']).to_csv(fname)
>>> pd.read_csv(fname, dtype='S5')
     0     b'x'
0  b'1'  b"b'y'"

Using dtype=str or dtype='S' does works as expected however?

>>> pd.read_csv(fname, dtype='S')
   0  b'x'
0  1  b'y'

I actually even find ^ unexpected since it seems to be interpreting as python string literals automatically?

If a user chooses to load CSV data as bytes it should be specified explicitly just like it works when you write out unicode and not inferred from python's encoding specific markup:

>>> pd.Series(['x', 'y']).to_csv(fname)
>>> pd.read_csv(fname)
   0  x
0  1  y
>>> >>> pd.read_csv(fname, dtype='S10')
   0  b'x'
0  1  b'y'

How can you in any way justify leaking python's encoding system syntax into a generic data exchange format?

I think everyone agrees that writing out the b prefixes is a bug :) My question is whether we should either

  1. attempt to decode all the bytes to text in to_csv before writing, using the provided encoding
  2. Raise an error, directing the user to perform the decoding before attempting to_csv

@TomAugspurger My vote's for 1.

Since the encoding kwarg determines the file's encoding any mismatching text-like data should be apropriately encoded before writing.

I'm getting worried though (especially being new to py3) because apparently even print does this?

>>> print(b'doggy')
b'doggy'

So maybe @dsm054 was right?

@TomAugspurger: I prefer your number 1: just decode, because that's what most users would want.

@tgoodlet: It doesn't matter what print does. Print is sort of a hybrid between being "pretty" and showing you what you'd need to reconstruct the variable. CSV writing is somewhat orthogonal.

Proposal to fix this issue:

We introduce a new parameter passed to .to_csv namely bytes_encoding which decides the encoding scheme used to decode the bytes (This gives the user the flexibility to write to a file opened with one encoding but the bytes to be decoded are of a different encoding. Example. If you want to write to path in UTF-16 but the data has ASCII bytes)

We use the encoding argument provided to .to_csv to decode the bytes.

Do note that after the decoding of the bytes happens using the bytes_encoding scheme, it WILL be transcoded to the encoding of the path/file object eventually before being written to the file. If this transcoding results in an error, we should report that.

Was this page helpful?
0 / 5 - 0 ratings