Pandas: skip rows after header leads to shifted header

Created on 24 Aug 2015 · 5Comments · Source: pandas-dev/pandas

Here's my file:

 ;;;;
 ;;;;
 ;;;;
 ;;;;
 ;;;;
 ;;;;
;;;;
 ;;;;
 ;;;;
;;;;
date;Param1;Param2;Param4;Param5
    ;m²;°C;m²;A
;;;;
01.01.1990 00:00;1;1;2;3
01.01.1990 01:00;5;3;4;5
01.01.1990 02:00;9;5;6;7
01.01.1990 03:00;13;7;8;9
01.01.1990 04:00;17;9;10;11
01.01.1990 05:00;21;11;12;13

If I read it with the following line:

 pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=0)

It gets the headers but then takes the two rows between header and data values.

date Param1 Param2 Param4 Param5
0                       m²     °C     m²      A
1               NaN    NaN    NaN    NaN    NaN
2  01.01.1990 00:00      1      1      2      3
3  01.01.1990 01:00      5      3      4      5
4  01.01.1990 02:00      9      5      6      7

If I use

 pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=2)

It simply moves the header down, i.e. same effect as saying

pd.read_csv(file, sep=';', header=12, parse_dates=True, skip_blank_lines=True, skiprows=0)

I think the skiprows needs to skip rows either before or after header.

But how do I skip after the header, only?

IO CSV Usage Question

Source

dacoex

Most helpful comment

No easy way to skip after the header. So either can drop those rows (though you lose type inference if you do that), or just read in the header then skip it the next time. Only read the rows you need so this will be performant.

In [53]: pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
Out[53]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')

In [54]: columns = pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns

In [55]: pd.read_csv(StringIO(data), sep=';', header=12, parse_dates=True, names=columns)
Out[55]: 
               date  Param1  Param2  Param4  Param5
0  01.01.1990 00:00       1       1       2       3
1  01.01.1990 01:00       5       3       4       5
2  01.01.1990 02:00       9       5       6       7
3  01.01.1990 03:00      13       7       8       9
4  01.01.1990 04:00      17       9      10      11
5  01.01.1990 05:00      21      11      12      13

jreback on 26 Aug 2015

👍2

All 5 comments

In [53]: pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
Out[53]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')

In [54]: columns = pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns

In [55]: pd.read_csv(StringIO(data), sep=';', header=12, parse_dates=True, names=columns)
Out[55]: 
               date  Param1  Param2  Param4  Param5
0  01.01.1990 00:00       1       1       2       3
1  01.01.1990 01:00       5       3       4       5
2  01.01.1990 02:00       9       5       6       7
3  01.01.1990 03:00      13       7       8       9
4  01.01.1990 04:00      17       9      10      11
5  01.01.1990 05:00      21      11      12      13

jreback on 26 Aug 2015

👍2

@jreback Thanks a lot. Really appreciated!

OK, so header is essentially like skiprows with the difference that the last skipped row is read as column headers as it says in the docs:
http://pandas-docs.github.io/pandas-docs-travis/io.html#io-read-csv-table
_header: row number(s) to use as the column names, and the start of the data._

Would it be worthwhile to write

_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows but does apply for skipping rows between header and data ._

instead

_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows_

dacoex on 26 Aug 2015

sorry u actually do this with skip rows as well it takes a list of rows to skip (which could include the row after the header)

jreback on 26 Aug 2015

For the sake of completeness, the solution goes like:

pd.read_csv(file, sep=';', skiprows=[11,12], index_col=0, parse_dates=True, header=10)

Shall I add this to the cookbook?

dacoex on 26 Aug 2015

actually that would be great - maybe showing both methods (and be sure to make it an inline example that one can copy paste)

jreback on 26 Aug 2015

Was this page helpful?

0 / 5 - 0 ratings

Related issues

AttributeError: Cannot use pandas from a script file

songololo · 3Comments

Cannot use apply on Series with Timestamp values

nathanielatom · 3Comments

Suffixes ignored on second merge

MatzeB · 3Comments

frame _apply_standard error when operating on 0 or NaN values

venuktan · 3Comments

Can't read csv using python pandas

Ashutosh-Srivastav · 3Comments