Pandas: read_csv() & EOF character in string cause parsing issue

Created on 12 Nov 2013 · 20Comments · Source: pandas-dev/pandas

While importing large text files using read_csv we occasionally get an EOF (End of File ) character within a string, which causes an exception: "Error tokenizing data. C error: EOF inside string starting at line. 844863" . This occurs even with "error_bad_lines = False"..

Further, the line stated in the error message is not the line containing the EOF character. In this particular case the actual row was approx. 230 rows before the one stated, which hinders exception handling. (I now see this difference was caused by other "bad_lines" that were being skipped - the quoted error line is correct but the imported rows was less.)

I feel it would be appropriate if "error_bad_lines = False" handled this exception and allowed such rows to be skipped.

I note that when importing this text file into Excel, the "premature" EOF is simply ignored.

We are running on Windows 8 , with python version 2.7 and pandas version 0.12

Bug IO CSV

Source

stephenjshaw

Most helpful comment

Further investigation using a hex editor has revealed what is going on:

I added 0x1A ("EOF") to a different file and it did not cause any problems. Pandas read_csv imported it without error.
I parsed every line of the problematic CSV individually, until I isolated the one causing the problem. It was over 3000 rows after the stated row number in the error message.
the row in question had a column with a double quote mark following the delimiter - there were not supposed to be any quote marks in the file. There was no second double quote in the column, or on the row
I think the quote mark caused the import to look for a second terminating double quote, ignoring column delimiters and end of line markers until it reached the end of the file. When it didn't find one before the end of the file, Im speculating that triggered the "EOF inside a string" error message.
I was able to work around the problem by setting the quotechar to be the same as the delimiter, while tells read_csv to ignore all quotes. It now imports the file perfectly.
I still think "error_bad_lines" should catch this by checking if any row contains a column with a missing terminating quote.
one reason I think this is important is that by adding a second such double quote, many lines apart, I was able to "fool" the system into skipping all the intervening lines, even though only two rows had an error.

stephenjshaw on 15 Nov 2013

👍10

All 20 comments

Further investigation using a hex editor has revealed what is going on:

I added 0x1A ("EOF") to a different file and it did not cause any problems. Pandas read_csv imported it without error.
I parsed every line of the problematic CSV individually, until I isolated the one causing the problem. It was over 3000 rows after the stated row number in the error message.
the row in question had a column with a double quote mark following the delimiter - there were not supposed to be any quote marks in the file. There was no second double quote in the column, or on the row
I think the quote mark caused the import to look for a second terminating double quote, ignoring column delimiters and end of line markers until it reached the end of the file. When it didn't find one before the end of the file, Im speculating that triggered the "EOF inside a string" error message.
I was able to work around the problem by setting the quotechar to be the same as the delimiter, while tells read_csv to ignore all quotes. It now imports the file perfectly.
I still think "error_bad_lines" should catch this by checking if any row contains a column with a missing terminating quote.
one reason I think this is important is that by adding a second such double quote, many lines apart, I was able to "fool" the system into skipping all the intervening lines, even though only two rows had an error.

stephenjshaw on 15 Nov 2013

👍10

@stephenjshaw are you able to try this on the master branch?

guyrt on 19 Dec 2013

On pandas 0.13.1, I had the exact same problem and solution.

rcompton on 20 Mar 2014

I am having the same issue and cannot find any offending characters in the lines near the line number given. Is there some way to search for weird characters given I have no clue where the issue is?

DataJunkie on 16 Apr 2014

another user seeing this/similar: http://stackoverflow.com/q/24005761/1240268

hayd on 3 Jun 2014

Note that, according to the documentation at http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html , error_bad_lines=False only means that lines with too many fields will be skipped. If there's other problems with your data, .read_csv() will fail rather than skip the problematic lines, cf https://github.com/pydata/pandas/issues/6478 and https://stackoverflow.com/questions/22026181/pandas-warn-bad-lines-false-and-error-bad-lines-false-is-still-trying-to-parse-b

rcompton on 3 Jun 2014

I don't think this is that hard to fix (essentially the low-level reader returns on EOF, but simple enough to check if that's actually the end of the file by reading again, if not, then can just ignore I think / remove that line).

anyone have a couple of test cases (e.g. need EOF inside a quote and outside). can generate the, but maybe @stephenjshaw already has a bit of code to do this?

jreback on 3 Jun 2014

Met the same problem here and solved it by @stephenjshaw solution.

yyl on 11 Jun 2014

With no examples to really draw from, I created my own here for future reference, but I get no errors:

>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n1\x1a,2'  # note the EOF in the middle of the last line
>>> read_csv(StringIO(data), engine='c')
    a  b
0  1  2
>>> read_csv(StringIO(data), engine='python')
    a  b
0  1  2

gfyoung on 2 Aug 2016

@jreback : In light of my examples above, IMO this is no longer an issue. Perhaps some tests?

gfyoung on 22 Aug 2016

yes I think your EOF PR closed this
can u add that issue number here

jreback on 22 Aug 2016

IIRC I didn't do a PR for the EOF (it was the NULL char and BOM). I can add tests though for this.

gfyoung on 22 Aug 2016

oh right ok then

jreback on 22 Aug 2016

I know I'm 4 years later for this issue... but I just encounter this bug again.
I don't think this bug is actually caused by EOF character inside a row of csv.
To reproduce the bug,

In [1]: import pandas as pd

In [2]: from StringIO import StringIO

In [3]: test_csv = '"a\tb\tc\n1\t2\t3'

In [4]: pd.read_csv(test_csv, delimiter='\t')

ParserError: Error tokenizing data. C error: EOF inside string starting at line 0

In [6]: test_csv = test_csv.translate(None, '"')

In [7]: pd.read_csv(StringIO(test_csv), delimiter='\t')
Out[7]:
   a  b  c
0  1  2  3

In [8]:

When I was loading a large csv using pandas, the error message tells me to look for line 853, which is a totally correct line...
The bug is actually thousands of lines behind.

I'm on macOS 10.12.6, python2.7 annaconda build and pandas version 0.21. This bug also exist in pandas version 0.20. Not sure about all versions before. But probably exists on all versions.

patrickwang96 on 23 Nov 2017

@patrickwang96 : Look at your CSV string. It's malformed with that unbalanced quotation mark. The error is to be expected.

gfyoung on 23 Nov 2017

👍1

It's not always possible to have a perfect CSV file, so where it's more important to have a loaded data file, and less important to get all the data, then it would be good that error_bad_lines does what's expected. I found that adding csv.QUOTE_NONE fixed my issue (as mentioned here: https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5)

morganics on 23 Jan 2018

@morganics : It would, except that pandas doesn't know where the line ends and begins in this case. You can't handle a bad line if you can't deduce where it begins or ends unfortunately.

gfyoung on 23 Jan 2018

I processed the same exact CSV file twice. One time it failed and the next time it did not. There must be some sort of race / memory condition causing this? Fun ; -)

edrossy on 27 Sep 2018

👍2

Of the two, probably memory condition. We don't have any concurrency for CSV parsing. Fun, indeed 😉

gfyoung on 29 Sep 2018

For the reason I pointed out in my answer to this question:
https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5/53173373#53173373
I would suggest to make the quoting=csv.QUOTE_NONE default instead of csv.QUOTE_MINIMAL.
It's easier to realise what's going on when your strings are unexpectedly parsed with quotechars then to get the error when there's odd number of quotechars or no error, but unexpected parsing for even number of quotechars.