header 1, header 2
data 1, data 2
data 3, data 4, data 5
ParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3
usecolserror_bad_lines to FalseIn read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.
@joshjacobson : Thanks for your report! This is an interesting proposal, but I'm wary about adding this because I have very rarely seen such a use case + the signature for read_csv is pretty bloated as it is.
I should add you can read the CSV in your yourself with Python's open command and patch each incorrect line by adding additional delimiters to indicate missing fields.
There is another workaround that does preserve all data: specify header names manually that are long enough:
In [90]: data = """a,b
...: 1,2
...: 1,2,3"""
In [92]: pd.read_csv(StringIO(data))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
In [93]: pd.read_csv(StringIO(data), names=['a', 'b', 'c'])
Out[93]:
a b c
0 a b NaN
1 1 2 NaN
2 1 2 3.0
@jorisvandenbossche : Ah, yes, that's a good point.
I suppose could add to cookbook. e.g. how to handle csv special cases. I agree no new options are needed, already way too many.
See also https://github.com/pandas-dev/pandas/issues/15122 for a general issue on improving the handling of 'bad' lines.
Although this exact case is not included in that list. The reason for that is also that this feature would be very expensive: it has first to check the length of all rows, to then parse again with the appropriate number of rows.
But I think it would be a nice addition in this section of the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines (it already shows how to drop the extra value of too long row, we could re-use the same example to show how to not drop that value)
Would you like to do a PR for that?
The comments here (by @gfyoung and the workaround by @jorisvandenbossche) seem to rely on this being a rare case. I'd like to push back on that belief, since this is an issue I run into quite frequently in my work (the circumstances of which aren't rare and make this issue especially likely). I estimate over the past year I would have used the proposed expand_header about once every two weeks, on about 1/3 of my data imports to pandas.
I believe there are a set of conditions that make this especially difficult to resolve, and that can co-occur with some frequency:
Candidate source data in which I experience 'bad lines' of unpredictable width:
I run into these types of files from the following:
Updated status:
In the initial post, I specified why I find workarounds involving Excel, usecols, and error_bad_lines to be inadequate (encoding issues, the desire to preserve all data).
@gfyoung suggested editing the data after reading it in using Python's open command
Under the circumstances I initially described (wanting to preserve all data, many data sets of uncertain width, and non-standard encoding) I believe this workflow would look something like this:
import csv
with open('some.csv', newline='', encoding='latin-1') as f:
reader = csv.reader(f)
max_width = 0
for row in reader:
length = row.count(',')
if length > max_width:
max_width = length
amended_rows = []
for row in reader:
length = row.count(',')
if length < max_width:
for _ in range(max_width - length):
row = row + ','
amended_rows.append(row)
writer = csv.writer(f)
writer.writerows(amended_rows)
pd.read_csv('some.csv')
@jorisvandenbossche identified an additional workaround: Specifying header names manually that are long enough. This workaround seems best suited for circumstances involving a singular dataset in which the maximum width is known.
For the circumstances I describe, I believe the workflow would be:
open commandimport csv
with open('some.csv', newline='', encoding='latin-1') as f:
reader = csv.reader(f)
max_width = 0
for row in reader:
length = row.count(',')
if length > max_width:
max_width = length
label = 'column_'
col_labels = []
for col_num in range(max_width):
label = label + str(col_num)
col_labels.append(label)
pd.read_csv('some.csv', names=col_labels)
Given the perhaps greater frequency than initially communicated, and/or the workflows that provide solutions, it may be worth considering a few ways forward:
Accept that while maybe not incredibly rare, this is still a rare enough event that no changes to pandas are warranted, and move forward as previously determined by others by adding this case to the docs.
Include in pandas some of the necessary functionality to make the workarounds successful. For example, adding a function that detects the maximum width of a row in a data file (without importing it as a DataFrame).
Eliminate the need for some of the activities in the workarounds. For example, allow the specification of the number of columns on import without providing names (avoiding the label generation for names).
[Initial proposal] In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.
@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward.
Good to know that you have encountered this issue with your datasets, but it's important to separate between data issues and read_csv being too strict. In this case, I think it falls more on the former than the latter given your description of the data sources.
Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue.
Option 2 is not generalizable because we don't want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into pandas.
Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4.
Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go.