Data.table: Implement comment.char argument in fread

Created on 3 Oct 2014  路  27Comments  路  Source: Rdatatable/data.table

Similar to read.table.

feature request fread top request

Most helpful comment

Bump. Needing this right now.

All 27 comments

Need to ignore whole lines (starting with comment) as well as trailing comments after valid lines.

Bump. Needing this right now.

+1

Actually fread seem to assume it already has this implemented as it is mentioning comment.char in its warnings. This is a warning I've recently saw (using the dev 1.10.5 version)

Warning in fread(file_x, skip = startind - 1L, header = TRUE, fill = TRUE) :
Stopped early on line 2. Expected 168 fields but found 235. Consider fill=TRUE and comment.char=.

Also, not sure why isn't this an error? It make it harder to catch it with tryCatch

Interesting. Can you chase the comment and find an author and commit per git blame?
I am mostly using the CRAN version so I work around the issue (when I have to, which is not that often).

Thanks. Which one can click on for git blame so yield ...

Better skip= and nrow= (#2623)

by Matt just one day ago (!!)

Bump

Bump, needing this in package development. Working with very large data with an unusual format. comment selection would be great.

@Berghopper consider using fread(cmd="grep pattern file.csv") till this feature in not available.

@jangorecki Already made my own custom function for parsing and loading the file. Thanks anyway however :).

Bump; would be super useful.

Also bumping. Even though fread(cmd='grep -v "#" table.csv') works fine in general, it's not cross platform compatible and makes your code a bit harder to read.

Bumping too. And if you want to skip only lines with hash at the beginning cmd='grep -v "^#" table.csv'.

Yeah bumping... Would be great to have that feature!

Hello, I have a problem with some comment line too. When I try to import my file with that command fread("Proteome_spodo.as.pfam31", skip='#'), an extract of this file is here Proteome_spodo.as.pfam31.txt

Fread return this :
_ # This is free software; you can redistribute it and/or modify it under V14 V15
1: # the terms of the GNU General Public License as published by the Free Software
2: # Foundation; either version 2 of the License, or (at your option) any later version.
3: # This program is distributed in the hope that it will be useful, but WITHOUT
Warning messages:
1: In fread("Proteome_spodo.as.pfam31", :
Detected 13 column names but the data has 15 columns (i.e. invalid file). Added 2 extra default column names at the end.
2: In fread("Proteome_spodo.as.pfam31", :
Stopped early on line 14. Expected 15 fields but found 12. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS>>

I try to use comment.char but there is no arguments like that. Also when I try without the skip it's the same result.

This is my session info :
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8
[6] LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.2 edgeR_3.26.0 limma_3.40.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 rstudioapi_0.10 magrittr_1.5 tidyselect_0.2.5 munsell_0.5.0 colorspace_1.4-1 lattice_0.20-38 R6_2.4.0 rlang_0.3.4
[10] plyr_1.8.4 dplyr_0.8.0.1 tools_3.6.0 grid_3.6.0 gtable_0.3.0 yaml_2.2.0 lazyeval_0.2.2 assertthat_0.2.1 tibble_2.1.1
[19] crayon_1.3.4 purrr_0.3.2 ggplot2_3.1.1 glue_1.3.1 compiler_3.6.0 pillar_1.3.1 scales_1.0.0 locfit_1.5-9.1 pkgconfig_2.0.2

Thank you

@ArthurPERE, from ?fread: skip="string" searches for "string" in the file (e.g. a substring of the column names row) and starts on that line, so I think you get the expected result.
You should consider using the following command:

fread("https://github.com/Rdatatable/data.table/files/3162861/Proteome_spodo.as.pfam31.txt", skip = 29, fill = TRUE)

with verbose = TRUE for more details, or the grep method mentioned above.

@Atrebas I didn't see it like that, for me it was skiping all the line with the string, but now I understand why you you would like to use a comment.char parameter.

Are they developing the comment.char parameter, it is for that we can't use this parameter now, but we have it on the error message ?

Thank you for your reply.

@ArthurPERE it is best to use a documentation as reference and defined behaviour. There you can also find there is no such a thing like comment.char parameter. You can find fread manual at https://rdatatable.gitlab.io/data.table/library/data.table/html/fread.html

AFAIR status of this FR or works on it are well reflected in comments. Be sure to upvote this FR so it will likely speed up its implementation, or at least prioritise. You are also welcome to submit a patch introducing such feature.

+1 for implementing this without the grep workaround (would be very useful!)

Another bump. This should absolutely be a standard feature without the grep workaround (which doesn't work on gzipped files, BTW). It seems silly to have to fall back on the much slower read.table ...

Bumping too. Would be nice to turn OFF the default skipping of lines beginning with "#", as with setting comment.char="".

This feature would also be useful to me for replacing usages of read.csv. I don't have the knowledge to comment on the above PR, though.

The lack of this option is quite surprising. I also got caught out by the "Stopped early... Consider fill=TRUE and comment.char=." message that very much suggested it was there somewhere. Would be great to have it.

Following up on this, I'm open to helping develop and/or test this functionality in a future update

@mjsteinbaugh you are very welcome, please submit PR

Was this page helpful?
0 / 5 - 0 ratings

Related issues

alex46015 picture alex46015  路  3Comments

jameslamb picture jameslamb  路  3Comments

DavidArenburg picture DavidArenburg  路  3Comments

tcederquist picture tcederquist  路  3Comments

franknarf1 picture franknarf1  路  3Comments