Here's my file:
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
;;;;
date;Param1;Param2;Param4;Param5
;m虏;掳C;m虏;A
;;;;
01.01.1990 00:00;1;1;2;3
01.01.1990 01:00;5;3;4;5
01.01.1990 02:00;9;5;6;7
01.01.1990 03:00;13;7;8;9
01.01.1990 04:00;17;9;10;11
01.01.1990 05:00;21;11;12;13
If I read it with the following line:
pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=0)
It gets the headers but then takes the two rows between header and data values.
date Param1 Param2 Param4 Param5
0 m虏 掳C m虏 A
1 NaN NaN NaN NaN NaN
2 01.01.1990 00:00 1 1 2 3
3 01.01.1990 01:00 5 3 4 5
4 01.01.1990 02:00 9 5 6 7
If I use
pd.read_csv(file, sep=';', header=10, parse_dates=True, skip_blank_lines=True, skiprows=2)
It simply moves the header down, i.e. same effect as saying
pd.read_csv(file, sep=';', header=12, parse_dates=True, skip_blank_lines=True, skiprows=0)
I think the skiprows needs to skip rows either before or after header.
But how do I skip after the header, only?
No easy way to skip after the header. So either can drop those rows (though you lose type inference if you do that), or just read in the header then skip it the next time. Only read the rows you need so this will be performant.
In [53]: pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
Out[53]: Index([u'date', u'Param1', u'Param2', u'Param4', u'Param5'], dtype='object')
In [54]: columns = pd.read_csv(StringIO(data), sep=';', header=10, parse_dates=True, nrows=10).columns
In [55]: pd.read_csv(StringIO(data), sep=';', header=12, parse_dates=True, names=columns)
Out[55]:
date Param1 Param2 Param4 Param5
0 01.01.1990 00:00 1 1 2 3
1 01.01.1990 01:00 5 3 4 5
2 01.01.1990 02:00 9 5 6 7
3 01.01.1990 03:00 13 7 8 9
4 01.01.1990 04:00 17 9 10 11
5 01.01.1990 05:00 21 11 12 13
@jreback Thanks a lot. Really appreciated!
OK, so header is essentially like skiprows with the difference that the last skipped row is read as column headers as it says in the docs:
http://pandas-docs.github.io/pandas-docs-travis/io.html#io-read-csv-table
_header: row number(s) to use as the column names, and the start of the data._
Would it be worthwhile to write
_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows but does apply for skipping rows between header and data ._
instead
_skiprows: A collection of numbers for rows in the file to skip. Can also be an integer to skip the first n rows_
sorry u actually do this with skip rows as well it takes a list of rows to skip (which could include the row after the header)
For the sake of completeness, the solution goes like:
pd.read_csv(file, sep=';', skiprows=[11,12], index_col=0, parse_dates=True, header=10)
Shall I add this to the cookbook?
actually that would be great - maybe showing both methods (and be sure to make it an inline example that one can copy paste)
Most helpful comment
No easy way to skip after the header. So either can drop those rows (though you lose type inference if you do that), or just read in the header then skip it the next time. Only read the rows you need so this will be performant.