Pandas: read_csv does not parse in header with BOM utf-8

Created on 10 Sep 2013  Â·  27Comments  Â·  Source: pandas-dev/pandas

I am using Pandas version 0.12.0 on a Mac.

I noticed that when there is a BOM utf-8 file, and if the header row is in the first line, the read_csv() method will leave a leading quotation mark in the first column's name. However, if the header row is further down the file and I use the "header=" option, then the whole header row gets parsed correctly.

Here is an example code:

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', header=9, thousands=',', encoding='utf-8')

Parses the header correctly.

bing_kw = pd.read_csv('../../data/sem/Bing-Keyword_daily.csv', thousands=',', encoding='utf-8')

Parses the first header column name incorrectly by leaving the leading quotation mark.

Bug IO CSV

Most helpful comment

I ran into this problem as well. I exported a table as a .csv file from DBeaver. There is an option to "Insert BOM" that was selected. The header was not read in properly by read_csv() (only the first column name was read). When I exported with "Insert BOM" un-selected read_csv() worked correctly.

I am running Python 3.7.2 with Pandas 0.23.4 on Linux Mint 19.1.

All 27 comments

which version of Python are you using? (python --version)

Python 2.7.5 :: Anaconda 1.6.1 (x86_64)

And the code is written in a python notebook using ipython 1.0.0.

@john-orange-aa can you provide a reproducible example (link to a file if you need to)

@jreblack, here is the link to the folder:
https://drive.google.com/folderview?id=0BwxOyJG828PySFFQVlBSUEdlcEk&usp=sharing

BOM-temp.csv is the offending file. BOM-temp2.csv is the same file with headers removed. The "pandas BOM utf-8 bug.ipynb" is the ipython notebook that illustrates the bug.

On Sep 28, 2013, at 12:50 PM, jreback [email protected] wrote:

@john-orange-aa can you provide a reproducible example (link to a file if you need to)

—
Reply to this email directly or view it on GitHub.

This isn't exactly the same issue, but I'm also having trouble with BOMs. File with a utf-8 BOM here.

    [~/work/]
    [1]: pd.version.version
    [1]: '0.12.0-1149-g141e93a'

    [~/work/]
    [2]: dta = pd.read_csv("bom_file.csv", encoding='utf-8')

    [~/work/]
    [3]: dta.columns[0]
    [3]: u'\ufeffDate (MMM-YY)'

It looks like you should use 'utf-8-sig' as the encoding for utf-8 files with a BOM, so my comment is likely invalid.

Is it possible for Pandas to infer the encoding from the BOM automatically - or is it really required to pass this information through in the encoding? I think the purpose of the BOM was to provide this kind of capability?

'utf-8-sig' does not resolve the issue. I faced the same issue but using 'utf-8-sig' just got me another decoding problem
UnicodeDecodeError: 'utf8' codec can't decode byte 0xae in position 14: invalid start byte

I ran into this problem as well (version 0.15.2). I tried 'utf-8-sig' encoding, and though I didn't see an error, the result was not quite right as the first key is quoted and none of the other keys are, though all column headers/values are quoted throughout the file.

f = open('data/bomex.csv')
header = f.read(10)
f.close()
header

out: '\xef\xbb\xbf"Name",'

f = codecs.open('data/bomex.csv', encoding='utf-8-sig')
header = f.read(10)
f.close()
header

out: u'"Name","Team","G"'

df = pd.read_csv('data/bomex.csv', encoding='utf-8-sig')
df.keys()[0]

out: u'"Name"'

df.keys()[1]

out: u'Team'

Note the extra set of quotes on the first key

With 0.15.2, I am able to use encoding="utf-8-sig" and the BOM disappears from the first column header.

In 0.15.2, I find that the BOM disappears, but the quotes around the first column header are erroneously preserved, while quotes around all other column headers (and all other values) are stripped.

So the problem with utf-8-sig seems to only affect quoted column headers. Here's an example file to try

https://dl.dropboxusercontent.com/u/27287953/bom.csv

...

Hi,

I'm using pandas 0.16 in python 3.4 (anaconda distro).
I was having a problem with some files, and it seems to be related with the BOM of the file.
I have this file: https://www.dropbox.com/s/nced7whmt2rr0c8/sample_file2.txt

I'm reading the file with:

import pandas as pd
test = pd.read_table("sample_file2.txt", decimal = ",", 
                     parse_dates = True)

When I print the column names:

>>> print(test.columns)
Index(['TimeStamp', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14',
 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29',
 'V30', 'V31'], dtype='object')

So, there seems to be a 'TimeStamp' column. Let me check what is in there:

>>> test["TimeStamp"] 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 1787, in __getitem__
[truncated]
File "pandas\hashtable.pyx", line 705, in pandas.hashtable.PyObjectHashTable.get_item
(pandas\hashtable.c:12300)
KeyError: 'TimeStamp'

And, I don't seem to be able to find out what is the real name of the TimeStamp column =/ I had this in a longer script and it took me forever to understand where the problem came from. Adding the encoding = "utf-8-sig" to the read_table solves the issue. Can pandas figure out the enconding on its own, to avoid such problems?

@zelite In this example, your column was actually spelled 'TimeStamp' (yes, with the proceeding space). I don't know if it's possible to detect encodings like this automatically... in general, that can be a hard problem.

@shoyer I don't think that was a space thing; it was definitely the BOM. If you do something like df.columns[0] you will see the utf-8 characters.

OK, I'm clearly lost. Just looked up exactly what "BOM" is :).

I can reproduce the issue with Pandas 0.16.0. I can get the same error on first column name with read_csv , or get the first row of the first column erroneous too with a file without columns on row 1 and with names= argument in the read_csv call.

It sounds like read_csv interprets correctly encoding='utf-8-sig' by skipping the 3 first characters of the file, then interpreting the file as UTF8. However, the bug experienced makes me think that Pandas "forgets" to skip the first 3 characters when it starts to parse the file and create the dataframe. Something like the offset of the beginning of the effective data in the file didn't get +len(UTF8_BOM), thus leading to have the BOM included in the first column name or in the first cell of the dataframe. The most misleading part is that the characters do not print naturally when the dataframe of the column names are displayed in ipython, but the BOM is clearly kept and behind that cell string as pointed out in a previous comment above.

HTH

With encoding 'utf-8-sig', the BOM is correctly skipped rather than prepending it to the first column label. However, as described by others, the quotes around the first label remain.

The zip archive at
https://www.dropbox.com/s/kcbh7fbsj9fwh13/sample.zip?dl=0

contains a script for demonstration as well as 2 csv files that differ only by having / not having a BOM. The one without BOM is parsed correctly, with BOM the first label remains quoted.

Python 3.5.1_x86, PD 18.1, Win7x64

A minimal example for future reference:

>>> from pandas.compat import BytesIO
>>> from pandas import read_csv
>>> import codecs
>>>
>>> BOM = codecs.BOM_UTF8
>>> data = '"name"\n"foo"'.encode('utf-8')
>>>
>>> read_csv(BytesIO(data), encoding='utf-8', engine='c')
# same result if engine='python'
  name
0  foo
>>>
>>> read_csv(BytesIO(BOM + data), encoding='utf-8', engine='c')
# same result if engine='python'
  "name"
0     foo

While I agree that there is a bug in the C engine, I don't believe the same can be said with the Python engine, as csv.reader (the foundation of the Python engine) cannot parse the BOM correctly (cannot run this if using Python 2.x):

>>> from io import TextIOWrapper
>>> from csv import reader
>>> for row in reader(BytesIO(BOM + data), encoding='utf-8'): print(row)
['\ufeff"name"']
['foo']

Since the Python engine failure is beyond our control, the question is then can this issue be closed if we can patch the C engine?

this is mainly an issue on windows, where these BOM markers can easily be put in files, so if possible to patch would be good.

Sorry for my ignorance, not sure why this issue is closed.

I have a file UTF-8 with BOM. The (quoted) content is:

'"node_id";"name_en"\n'
'"";"English name"\n'
'"39252";"TEST USE CASE 1"\n'
x=pd.read_csv('UTF8withBOM.csv',encoding='utf-8-sig',delimiter=";",quotechar='"')
print(x)
  "node_id"                                    name_en
0        NaN                           English name
1    39252.0  TEST USE CASE 1



md5-88198d9bf5b312e1736e5f663fe835d5



x=pd.read_csv('UTF8withoutBOM.csv',encoding='utf-8',delimiter=";",quotechar='"')
print(x)
   node_id                                    name_en
0      NaN                           English name
1  39252.0  TEST USE CASE 1



md5-101168642f153c16ba61ccc925a16b1f



pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

the issue was closed in 0.19.0

try using a more recent version

I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue.

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: 0.7.9.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

Try with pandas 20.3.

Am 17.10.2017 um 23:41 schrieb sf_jac notifications@github.com:

I'm using pandas.read_fwf in 0.19.0 and seeing a similar issue.

INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.25.2
numpy: 1.11.2
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.3.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.7
blosc: None
bottleneck: 1.1.0
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: 1.1.5
pymysql: 0.7.9.None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: 2.39.0
pandas_datareader: None

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.

use sep=' \t '
df=pd.read_csv('titanic.csv',sep='\t')

I am still experiencing this problem on Windows with Pandas 0.24.0.

(False alarm -- I was opening the file with argparse.)

I ran into this problem as well. I exported a table as a .csv file from DBeaver. There is an option to "Insert BOM" that was selected. The header was not read in properly by read_csv() (only the first column name was read). When I exported with "Insert BOM" un-selected read_csv() worked correctly.

I am running Python 3.7.2 with Pandas 0.23.4 on Linux Mint 19.1.

Bug is still persistent in 1.0.3 as of today. Replicated with all combinations of the Python UTF 8 encoding string with or without hyphens, underscores and "sig" extensions. Stepping through some of the code was showing that the encoding was stuck in code-point 1252 for a very long time before it became read as UTF8 - Is it not being set early enough?

  • OS: Windows 10 x64
  • Python: 3.7.4
  • Version: pandas 1.0.3, installed via pip 20.1.1

ETA: Nothing online appears to catch this but it appears this can be replicated (and solved) as follows:

import pandas as pd
...
with open(filename, encoding="xxx") as f_handle:
    data = pd.read_csv(f_handle, encoding="yyy")

It was not clear from the documentation that encoding xxx stamps over encoding yyy when reading the names field, but does not stamp on yyy when reading the rows. This is important as not setting encoding xxx will lead to the default value overwriting yyy I might have missed it in the docs, if I haven't then this should be added as a note somewhere?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

andreas-thomik picture andreas-thomik  Â·  3Comments

matthiasroder picture matthiasroder  Â·  3Comments

jaradc picture jaradc  Â·  3Comments

Abrosimov-a-a picture Abrosimov-a-a  Â·  3Comments

amelio-vazquez-reina picture amelio-vazquez-reina  Â·  3Comments