Pandas: read_csv and unicode characters in filename (python 2.7, pandas 15.2)

Created on 6 Feb 2015  路  10Comments  路  Source: pandas-dev/pandas

The code:

import pandas
df = pandas.read_csv(u"C:/鎴愬姛渚婹309~Metadata.tsv")

does not work, and gives the output:

IOError: File C:/???Q309.ppt~Metadata.tsv does not exist

It seems similar in nature to this issue: https://github.com/pydata/pandas/issues/9315 however #9315 was reportedly fixed in 14.2 with 3.3.5. I am using 15.1 and 2.7.7.

Here is the output of pd.show_versions():

commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.15.2
nose: 1.3.3
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.2
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

Thanks,
Justin

IO CSV Unicode

Most helpful comment

For what it's worth, the following is a workaround which seems to be doing the trick:

f = open(u"C:/鎴愬姛渚婹309~Metadata.tsv")
df = pd.read_csv(f)
f.close()

All 10 comments

For what it's worth, the following is a workaround which seems to be doing the trick:

f = open(u"C:/鎴愬姛渚婹309~Metadata.tsv")
df = pd.read_csv(f)
f.close()

see this issue here: https://github.com/pydata/pandas/issues/6770

This is already in 0.15.2 (e.g. it will decode with the system encoding). So I think you maybe need to set it.

My mistake, I am using 0.15.2 (not 15.1).

But I'm still not clear, what are you suggesting that I "set"? The system encoding? This is something that I would need to do before loading the file?

Thanks, Justin

I think the system encoding might be set to something odd

you can try setting to utf-8 and see if it works

The filesystemencoding and defaultsystemencoding are 'mbcs' and 'cp1252' respectively:

sys.getfilesystemencoding()
Out[12]: 'mbcs'

sys.getdefaultencoding()
Out[13]: 'cp1252'

These options all fail in a similar way though:

df = pandas.read_csv(u"C:/鎴愬姛渚婹309~Metadata.tsv", encoding='utf-8')
df = pandas.read_csv(u"C:/鎴愬姛渚婹309~Metadata.tsv", encoding='mbcs')
df = pandas.read_csv(u"C:/鎴愬姛渚婹309~Metadata.tsv", encoding='cp1252')

Should I bet setting the encoding in a different way?

these have to do with the encoding of the file itself not the filename
try decoding that filename before passing

eg

the_filename.decode('utf-8') then pass the filename

Using filename.decode('utf'8') gives this error:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 3-5: character maps to <undefined>

I think this is an issue with the filesystem / encoding. Let me know if it's still a problem, and if Python's builtin open(filename) works, but pandas read_csv does not.

This still happens for me. The worst part is that if I use the workaround using open, read_csv does not parse the utf-8 in the file correctly anymore. Any help?

Try using Open command as below. It worked for me.
df = pd.read_csv(open(filename, 'r'))

Was this page helpful?
0 / 5 - 0 ratings

Related issues

songololo picture songololo  路  3Comments

matthiasroder picture matthiasroder  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

andreas-thomik picture andreas-thomik  路  3Comments