Pandas: [Feature Request] On import, allow option for number of fields to match widest row of data (filling missing with NaN)

Created on 23 Aug 2017 · 8Comments · Source: pandas-dev/pandas

Given data, such as below in 'csv' format:

header 1, header 2
data 1, data 2
data 3, data 4, data 5

On import, Pandas will fail with the error:

ParserError: Error tokenizing data. C error: Expected 2 fields in line 2, saw 3

There are many ways around this error:

Loading the file in a spreadsheet program (i.e. Excel), and saving it as a new file
Specifying usecols
Setting error_bad_lines to False

But there are instances where these workarounds are impractical:

When you want to preserve all of your data (usecols and error_bad_lines will cause loss)
When using a spreadsheet program is not feasible (i.e. encodings not supported by Excel, or when dealing with a large number of files (or in particular, the combination of the two))

Desired feature

In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.

Docs IO CSV Missing-data

Source

joshjacobson

👍2

All 8 comments

@joshjacobson : Thanks for your report! This is an interesting proposal, but I'm wary about adding this because I have very rarely seen such a use case + the signature for read_csv is pretty bloated as it is.

I should add you can read the CSV in your yourself with Python's open command and patch each incorrect line by adding additional delimiters to indicate missing fields.

gfyoung on 23 Aug 2017

👍1

There is another workaround that does preserve all data: specify header names manually that are long enough:

In [90]: data = """a,b
    ...: 1,2
    ...: 1,2,3"""

In [92]: pd.read_csv(StringIO(data))
...
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

In [93]: pd.read_csv(StringIO(data), names=['a', 'b', 'c'])
Out[93]: 
   a  b    c
0  a  b  NaN
1  1  2  NaN
2  1  2  3.0

jorisvandenbossche on 24 Aug 2017

👍1

@jorisvandenbossche : Ah, yes, that's a good point.

gfyoung on 24 Aug 2017

I suppose could add to cookbook. e.g. how to handle csv special cases. I agree no new options are needed, already way too many.

jreback on 24 Aug 2017

See also https://github.com/pandas-dev/pandas/issues/15122 for a general issue on improving the handling of 'bad' lines.
Although this exact case is not included in that list. The reason for that is also that this feature would be very expensive: it has first to check the length of all rows, to then parse again with the appropriate number of rows.

But I think it would be a nice addition in this section of the docs: http://pandas.pydata.org/pandas-docs/stable/io.html#handling-bad-lines (it already shows how to drop the extra value of too long row, we could re-use the same example to show how to not drop that value)

jorisvandenbossche on 24 Aug 2017

Would you like to do a PR for that?

jorisvandenbossche on 24 Aug 2017

The comments here (by @gfyoung and the workaround by @jorisvandenbossche) seem to rely on this being a rare case. I'd like to push back on that belief, since this is an issue I run into quite frequently in my work (the circumstances of which aren't rare and make this issue especially likely). I estimate over the past year I would have used the proposed expand_header about once every two weeks, on about 1/3 of my data imports to pandas.

I believe there are a set of conditions that make this especially difficult to resolve, and that can co-occur with some frequency:

Data with 'bad lines' of unpredictable widths
Desire to keep all data
Many source data files
Foreign characters (such that encoding needs to be specified)

Candidate source data in which I experience 'bad lines' of unpredictable width:

Data without column headers, that also has missing values
Data collected for a variety of non-standardized sources, collated together
Data parsed from pdf

I run into these types of files from the following:

Data collected within developing countries
-- This can frequently involve little to no auditing / professional oversight, so if there's someone collecting 10 fields of data in one geography, and someone collecting 25 in another, they may be simply combined as lines within a single data source (without headers). The next week may have a quite different presentation in terms of number of fields.
-- I've had workflows involve handling a large number of these at once... for example 8 years of weekly reports across ~1000 geographic administrations (appx. 400,000 data files).

Machine generated datasets without headers
-- For example, I worked with a dataset of weather stations (each station to a row) that collected data (without headers) in which rainfall was reported for each instance as a new column. This was of unpredictable width, and there were a number of different files to deal with (various geographies, various time periods)

Updated status:

In the initial post, I specified why I find workarounds involving Excel, usecols, and error_bad_lines to be inadequate (encoding issues, the desire to preserve all data).

@gfyoung suggested editing the data after reading it in using Python's open command

Under the circumstances I initially described (wanting to preserve all data, many data sets of uncertain width, and non-standard encoding) I believe this workflow would look something like this:

Read in each line
Detect the width of the longest line
Reading in each line again, appending commas to match the width of the longest line
Write file with amended lines
Import to pandas

import csv
with open('some.csv', newline='', encoding='latin-1') as f:
    reader = csv.reader(f)
    max_width = 0
    for row in reader:
        length = row.count(',')
        if length > max_width:
             max_width = length

    amended_rows = []
    for row in reader:
        length = row.count(',')
        if length < max_width:
             for _ in range(max_width - length):
                  row = row + ','
        amended_rows.append(row)

    writer = csv.writer(f)
    writer.writerows(amended_rows)

pd.read_csv('some.csv')

@jorisvandenbossche identified an additional workaround: Specifying header names manually that are long enough. This workaround seems best suited for circumstances involving a singular dataset in which the maximum width is known.

For the circumstances I describe, I believe the workflow would be:

Read in each line using Python's open command
Detect the width of the longest line
Generate a list of labels of length matching the width of the longest line
Import to pandas while specifying label list

import csv
with open('some.csv', newline='', encoding='latin-1') as f:
    reader = csv.reader(f)
    max_width = 0
    for row in reader:
        length = row.count(',')
        if length > max_width:
             max_width = length

label = 'column_'
col_labels = []
for col_num in range(max_width):
    label = label + str(col_num)
    col_labels.append(label)

pd.read_csv('some.csv', names=col_labels)

Given the perhaps greater frequency than initially communicated, and/or the workflows that provide solutions, it may be worth considering a few ways forward:

Accept that while maybe not incredibly rare, this is still a rare enough event that no changes to pandas are warranted, and move forward as previously determined by others by adding this case to the docs.
Include in pandas some of the necessary functionality to make the workarounds successful. For example, adding a function that detects the maximum width of a row in a data file (without importing it as a DataFrame).
Eliminate the need for some of the activities in the workarounds. For example, allow the specification of the number of columns on import without providing names (avoiding the label generation for names).
[Initial proposal] In read_csv and read_table, add an expand_header option, which when set to True will set the width of the DataFrame to that of the widest row. Cells left without data due to this would be blank or NaN.

joshjacobson on 30 Aug 2017

@joshjacobson : Thanks for your reply. I think I would agree with @jorisvandenbossche that your first option is the best way to move forward.

Good to know that you have encountered this issue with your datasets, but it's important to separate between data issues and read_csv being too strict. In this case, I think it falls more on the former than the latter given your description of the data sources.

Also, I understand that you believe that this problem is prevalent given that you encounter it on so many occasions, but for us as maintainers to agree with that assessment, we generally need to see a good number of users independently bring up this issue.

Option 2 is not generalizable because we don't want to loosen our requirements on files being correctly formed. The responsibility should be on the user to make sure that their file is formatted correctly to be read into pandas.

Option 3 would likely be best integrated as another parameter, but we have too many parameters already. The same goes for Option 4.

Again, if more people bring up this issue, we can always revisit, but at this point, I would stick with option 1 as your way to go.

gfyoung on 30 Aug 2017

👍1

Was this page helpful?

0 / 5 - 0 ratings