Data.table: fread skip doesn't get the header names.

Created on 26 Mar 2017  路  7Comments  路  Source: Rdatatable/data.table

When you use fread with the option _skip=_ the file is read skipping the first lines...
That's OK, but there is a small problem, the first line contains the header, and you end up having no column names in your data.table.
I solve the problem using fread twice. By first reading only the first line and saving its content, and later reading the file again skiping as desired, and the renaming the columns.

I think it would be a good idea if fread always read the first line and use it as column names in case you decided so and no matter the skip value.

enhancement fread

Most helpful comment

Currently the header argument could be TRUE, FALSE, or "auto". Conceivably, we could extend it to also accept integers, so that for example header=1 would mean "the header is on the 1st line of the file". Similarly, header=3 would mean that the header is on the 3rd line, etc.

This would be independent of skipping, so that you can say header=1 and skip=1000000 to skip the first 1M lines, while still taking the column names from the first line.

This could be taken even further: header=1:2 could be used for files with multi-line headers. I seem to recall there was a request for this not too long ago.

All 7 comments

This is consistent with other readers (e.g. read.csv/readLines) & in my experience typically the desired behavior -- most of the time I prefer to supply my own column names anyway, and usually when I need to skip in a file, the header is _not_ on the first line, but pushed down several lines.

I suppose an option along the lines of header = 1L might work (but could also break code relying on 1L evaluating to TRUE?).

Thanks @skanskan. Yes I know what you mean and agree. A recent change in dev is that the skip= control determines which line the data starts on. Whether column names or not is now correctly determined by header=TRUE|FALSE|"auto" with default "auto". And the "auto" is more advanced now too.
Please check dev 1.10.5 as of now and raise a new issue if it's not as you want, including a small reproducible example too please.
@MichaelChirico If you could check the recent change to skip= too please, since you commented here.

I think current behavior is good. I notice header = 'auto' may not accomplish what was intended in this thread?

# very nice
fread('# some metadata
# created by
# created date
# column types/YAML
X1,X2,X3,X4
1,2,3,4
5,6,7,8', skip = 4L)
#    X1 X2 X3 X4
# 1:  1  2  3  4
# 2:  5  6  7  8
# not as intended?
fread('X1,X2,X3,X4
gobbledygook
lorum ipsum
spaz typing
1,2,3,4
5,6,7,8', skip = 4L)
#    V1 V2 V3 V4  <- should be X1:4, no?
# 1:  1  2  3  4
# 2:  5  6  7  8

That output looks as intended; i.e., no attempt made (not now nor in future) to remove junk lines between the column names and the first data row. If column names are present, they must be on the line immediately before where the data rows start (well, other than blank lines, depending on blank.lines.skip). I'm only aware of files having banners above the column names.

If you view skipped rows as junk lines, this makes sense. However, if you use skipping rows as a way to save reading time, this does not make sense.

I have million rows data saved as CSV, which has timestamp in the first column as sorted index. I read the first column first to locate the rows I needed, and then I read the full data between row a and row b use skip and nrows.

Not saying which way is right or wrong, just present a valid use case for keeping the first row as the header row.

@jflycn To reliably implement that there should be two different arguments, one for the purpose of skipping junk rows, and another one using for chunking. I recall @st-pasha recently explained why that matters.
Not sure if there is a FR for that already, but you can always create new one if you cannot find existing.

Currently the header argument could be TRUE, FALSE, or "auto". Conceivably, we could extend it to also accept integers, so that for example header=1 would mean "the header is on the 1st line of the file". Similarly, header=3 would mean that the header is on the 3rd line, etc.

This would be independent of skipping, so that you can say header=1 and skip=1000000 to skip the first 1M lines, while still taking the column names from the first line.

This could be taken even further: header=1:2 could be used for files with multi-line headers. I seem to recall there was a request for this not too long ago.

Was this page helpful?
0 / 5 - 0 ratings