While importing large text files using read_csv we occasionally get an EOF (End of File ) character within a string, which causes an exception: "Error tokenizing data. C error: EOF inside string starting at line. 844863" . This occurs even with "error_bad_lines = False"..
Further, the line stated in the error message is not the line containing the EOF character. In this particular case the actual row was approx. 230 rows before the one stated, which hinders exception handling. (I now see this difference was caused by other "bad_lines" that were being skipped - the quoted error line is correct but the imported rows was less.)
I feel it would be appropriate if "error_bad_lines = False" handled this exception and allowed such rows to be skipped.
I note that when importing this text file into Excel, the "premature" EOF is simply ignored.
We are running on Windows 8 , with python version 2.7 and pandas version 0.12
Further investigation using a hex editor has revealed what is going on:
@stephenjshaw are you able to try this on the master branch?
On pandas 0.13.1, I had the exact same problem and solution.
I am having the same issue and cannot find any offending characters in the lines near the line number given. Is there some way to search for weird characters given I have no clue where the issue is?
another user seeing this/similar: http://stackoverflow.com/q/24005761/1240268
Note that, according to the documentation at http://pandas.pydata.org/pandas-docs/version/0.13.1/generated/pandas.io.parsers.read_csv.html , error_bad_lines=False only means that lines with too many fields will be skipped. If there's other problems with your data, .read_csv() will fail rather than skip the problematic lines, cf https://github.com/pydata/pandas/issues/6478 and https://stackoverflow.com/questions/22026181/pandas-warn-bad-lines-false-and-error-bad-lines-false-is-still-trying-to-parse-b
I don't think this is that hard to fix (essentially the low-level reader returns on EOF, but simple enough to check if that's actually the end of the file by reading again, if not, then can just ignore I think / remove that line).
anyone have a couple of test cases (e.g. need EOF inside a quote and outside). can generate the, but maybe @stephenjshaw already has a bit of code to do this?
Met the same problem here and solved it by @stephenjshaw solution.
With no examples to really draw from, I created my own here for future reference, but I get no errors:
>>> from pandas import read_csv
>>> from pandas.compat import StringIO
>>>
>>> data = 'a,b\n1\x1a,2' # note the EOF in the middle of the last line
>>> read_csv(StringIO(data), engine='c')
a b
0 1 2
>>> read_csv(StringIO(data), engine='python')
a b
0 1 2
@jreback : In light of my examples above, IMO this is no longer an issue. Perhaps some tests?
yes I think your EOF PR closed this
can u add that issue number here
IIRC I didn't do a PR for the EOF (it was the NULL char and BOM). I can add tests though for this.
oh right ok then
I know I'm 4 years later for this issue... but I just encounter this bug again.
I don't think this bug is actually caused by EOF character inside a row of csv.
To reproduce the bug,
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: test_csv = '"a\tb\tc\n1\t2\t3'
In [4]: pd.read_csv(test_csv, delimiter='\t')
ParserError: Error tokenizing data. C error: EOF inside string starting at line 0
In [6]: test_csv = test_csv.translate(None, '"')
In [7]: pd.read_csv(StringIO(test_csv), delimiter='\t')
Out[7]:
a b c
0 1 2 3
In [8]:
When I was loading a large csv using pandas, the error message tells me to look for line 853, which is a totally correct line...
The bug is actually thousands of lines behind.
I'm on macOS 10.12.6, python2.7 annaconda build and pandas version 0.21. This bug also exist in pandas version 0.20. Not sure about all versions before. But probably exists on all versions.
@patrickwang96 : Look at your CSV string. It's malformed with that unbalanced quotation mark. The error is to be expected.
It's not always possible to have a perfect CSV file, so where it's more important to have a loaded data file, and less important to get all the data, then it would be good that error_bad_lines does what's expected. I found that adding csv.QUOTE_NONE fixed my issue (as mentioned here: https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5)
@morganics : It would, except that pandas
doesn't know where the line ends and begins in this case. You can't handle a bad line if you can't deduce where it begins or ends unfortunately.
I processed the same exact CSV file twice. One time it failed and the next time it did not. There must be some sort of race / memory condition causing this? Fun ; -)
Of the two, probably memory condition. We don't have any concurrency for CSV parsing. Fun, indeed 馃槈
For the reason I pointed out in my answer to this question:
https://stackoverflow.com/questions/18016037/pandas-parsererror-eof-character-when-reading-multiple-csv-files-to-hdf5/53173373#53173373
I would suggest to make the quoting=csv.QUOTE_NONE
default instead of csv.QUOTE_MINIMAL
.
It's easier to realise what's going on when your strings are unexpectedly parsed with quotechars then to get the error when there's odd number of quotechars or no error, but unexpected parsing for even number of quotechars.
Most helpful comment
Further investigation using a hex editor has revealed what is going on: