Pandas: BUG: read_csv skipfooter fails with invalid quoted line

Created on 5 Apr 2017 · 13Comments · Source: pandas-dev/pandas

Code Sample, a copy-pastable example if possible

from pandas.compat import StringIO

pd.read_csv(StringIO('''Date,Value
1/1/2012,100.00
1/2/2012,102.00
"a quoted junk row"morejunk'''),  skipfooter=1)

Out[21]
ERROR:root:An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 20))

---------------------------------------------------------------------------
Error                                     Traceback (most recent call last)
<ipython-input-34-d8dff6b9f4a7> in <module>()
      2 1/1/2012,100.00
      3 1/2/2012,102.00
----> 4 "a quoted junk row" '''),  skipfooter=1)

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    651                     skip_blank_lines=skip_blank_lines)
    652 
--> 653         return _read(filepath_or_buffer, kwds)
    654 
    655     parser_f.__name__ = name

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in _read(filepath_or_buffer, kwds)
    404 
    405     try:
--> 406         data = parser.read()
    407     finally:
    408         parser.close()

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in read(self, nrows)
    977                 raise ValueError('skipfooter not supported for iteration')
    978 
--> 979         ret = self._engine.read(nrows)
    980 
    981         if self.options.get('as_recarray'):

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in read(self, rows)
   2066     def read(self, rows=None):
   2067         try:
-> 2068             content = self._get_lines(rows)
   2069         except StopIteration:
   2070             if self._first_chunk:

C:\Users\chris.bartak\Documents\python-dev\pandas\pandas\io\parsers.py in _get_lines(self, rows)
   2717                         while True:
   2718                             try:
-> 2719                                 new_rows.append(next(source))
   2720                                 rows += 1
   2721                             except csv.Error as inst:

Error: ',' expected after '"'

Problem description

This error only happens if the last row has quoting, and is invalid - e.g. delete the morejunk above and it does not error.

Expected Output

successful parse

pandas 0.19.2

Bug Error Reporting IO CSV

Source

chris-b1

Most helpful comment

Here's a simpler example that we can use:
~~~python

data = 'an1n"a"b'
read_csv(StringIO(data), engine='c')
a
0 1
1 ab

read_csv(StringIO(data), engine='python')
...
_csv.Error: ',' expected after '"'

read_csv(StringIO(data), engine='python', skipfooter=1)
...
_csv.Error: ',' expected after '"'
~~~

gfyoung on 5 Apr 2017

❤1 👍1

All 13 comments

Hmm, I guess this is the same as #13879 - although the PR to improve the error message doesn't seem to have caught this case cc @gfyoung

chris-b1 on 5 Apr 2017

@chris-b1 : Could you post the full the stacktrace? I presume that that error message is coming from Python's csv library but would like to double check (no access to computer ATM).

gfyoung on 5 Apr 2017

yep, edited in the top comment

chris-b1 on 5 Apr 2017

Awesome. Yep, I think your diagnosis is correct. I can quickly patch that.

gfyoung on 5 Apr 2017

Deeper analysis indicates that you can successfully parse this with the C engine on master:
~~~python
pd.read_csv(StringIO('''Date,Value
1/1/2012,100.00
1/2/2012,102.00
"a quoted junk row"morejunk''')

                    Date  Value

0 1/1/2012 100.0
1 1/2/2012 102.0
2 a quoted junk rowmorejunk NaN
~~~
However, the Python cannot read this correctly (with or without the skipfooter argument). I'm not sure why the Python engine would complain about this. This parsing seems correct from the C engine.

@chris-b1 : What do you think?

gfyoung on 5 Apr 2017

Here's a simpler example that we can use:
~~~python

data = 'an1n"a"b'
read_csv(StringIO(data), engine='c')
a
0 1
1 ab

read_csv(StringIO(data), engine='python')
...
_csv.Error: ',' expected after '"'

read_csv(StringIO(data), engine='python', skipfooter=1)
...
_csv.Error: ',' expected after '"'
~~~

gfyoung on 5 Apr 2017

❤1 👍1

This inconsistency notwithstanding, it would still be worthwhile to properly catch errors there at that try-except block. A PR can go up for that at the very least.

gfyoung on 5 Apr 2017

Yeah, it does seem like that should parse. builtin csv reader doesn't complain

import csv
data = 'a\n1\n"a"b'
list(csv.reader(StringIO(data)))

Out[16]: [['a'], ['1'], ['ab']]

chris-b1 on 5 Apr 2017

Oh, interesting...does your original example work with csv.reader(StringIO(...)) ? Maybe try passing in strict=True to csv.reader as well?

gfyoung on 5 Apr 2017

It does using defaults, but not with strict=True

chris-b1 on 5 Apr 2017

Ah, that's the reason then. Hmmm...seems like we wouldn't consider that malformed though. Well, as we can't "fix" the Python parser, I think we can add the test at least though.

gfyoung on 5 Apr 2017

Actually, here's a "fix" (it just goes to show how broken regex splitting in the Python engine is):
~~~python

data = 'an1n"a"b'
read_csv(StringIO(data), engine='python', sep='pandas')
a
0 1
1 ab
~~~

gfyoung on 6 Apr 2017

Here's a simpler example that we can use:

>>> data = 'a\n1\n"a"b'
>>> read_csv(StringIO(data), engine='c')
    a
0   1
1  ab
>>>
>>> read_csv(StringIO(data), engine='python')
...
_csv.Error: ',' expected after '"'
>>>
>>> read_csv(StringIO(data), engine='python', skipfooter=1)
...
_csv.Error: ',' expected after '"'

engine='c' does the job for me. Finally got my task working after a huge but simple hurdle.

Thank you!