Pandas: skip rows after header leads to shifted header

Created on 24 Aug 2015  路  5Comments  路  Source: pandas-dev/pandas

Here's my file:

 ;;;;
 ;;;;
 ;;;;
 ;;;;
 ;;;;
 ;;;;
;;;;
 ;;;;
 ;;;;
;;;;
date;Param1;Param2;Param4;Param5
    ;m虏;掳C;m虏;A
;;;;
01.01.1990 00:00;1;1;2;3
01.01.1990 01:00;5;3;4;5
01.01.1990 02:00;9;5;6;7
01.01.1990 03:00;13;7;8;9
01.01.1990 04:00;17;9;10;11
01.01.1990 05:00;21;11;12;13

If I read it with the following line:

 pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=0)

It gets the headers but then takes the two rows between header and data values.

date Param1 Param2 Param4 Param5
0                       m虏     掳C     m虏      A
1               NaN    NaN    NaN    NaN    NaN
2  01.01.1990 00:00      1      1      2      3
3  01.01.1990 01:00      5      3      4      5
4  01.01.1990 02:00      9      5      6      7

If I use

 pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=2)

It simply moves the header down, i.e. same effect as saying

pd.read_csv(file, sep=';', header=12, parse_dates=True, skip_blank_lines=True, skiprows=0)

I think the skiprows needs to skip rows either before or after header.

But how do I skip after the header, only?

IO CSV Usage Question

Most helpful comment

No easy way to skip after the header. So either can drop those rows (though you lose type inference if you do that), or just read in the header then skip it the next time. Only read the rows you need so this will be performant.

In [53]: pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
Out[53]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')

In [54]: columns = pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns

In [55]: pd.read_csv(StringIO(data), sep=';', header=12, parse_dates=True, names=columns)
Out[55]: 
               date  Param1  Param2  Param4  Param5
0  01.01.1990 00:00       1       1       2       3
1  01.01.1990 01:00       5       3       4       5
2  01.01.1990 02:00       9       5       6       7
3  01.01.1990 03:00      13       7       8       9
4  01.01.1990 04:00      17       9      10      11
5  01.01.1990 05:00      21      11      12      13

All 5 comments

No easy way to skip after the header. So either can drop those rows (though you lose type inference if you do that), or just read in the header then skip it the next time. Only read the rows you need so this will be performant.

In [53]: pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
Out[53]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')

In [54]: columns = pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns

In [55]: pd.read_csv(StringIO(data), sep=';', header=12, parse_dates=True, names=columns)
Out[55]: 
               date  Param1  Param2  Param4  Param5
0  01.01.1990 00:00       1       1       2       3
1  01.01.1990 01:00       5       3       4       5
2  01.01.1990 02:00       9       5       6       7
3  01.01.1990 03:00      13       7       8       9
4  01.01.1990 04:00      17       9      10      11
5  01.01.1990 05:00      21      11      12      13

@jreback Thanks a lot. Really appreciated!

OK, so header is essentially like skiprows with the difference that the last skipped row is read as column headers as it says in the docs:
http://pandas-docs.github.io/pandas-docs-travis/io.html#io-read-csv-table
_header: row number(s) to use as the column names, and the start of the data._

Would it be worthwhile to write

_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows but does apply for skipping rows between header and data ._

instead

_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows_

sorry u actually do this with skip rows as well it takes a list of rows to skip (which could include the row after the header)

For the sake of completeness, the solution goes like:

pd.read_csv(file, sep=';', skiprows=[11,12], index_col=0, parse_dates=True, header=10)

Shall I add this to the cookbook?

actually that would be great - maybe showing both methods (and be sure to make it an inline example that one can copy paste)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

songololo picture songololo  路  3Comments

nathanielatom picture nathanielatom  路  3Comments

MatzeB picture MatzeB  路  3Comments

venuktan picture venuktan  路  3Comments

Ashutosh-Srivastav picture Ashutosh-Srivastav  路  3Comments