Data.table: fread don't convert a quoted numbers with colClasses

Created on 5 Jan 2016  Â·  8Comments  Â·  Source: Rdatatable/data.table

Hi.

fread() don't recognize quoted numeric data. Code to reproduce:

df <- data.frame(A = as.character(1:5), B = letters[1:5], stringsAsFactors = FALSE)
str(df)
#> 'data.frame':    5 obs. of  2 variables:
#>  $ A: chr  "1" "2" "3" "4" ...
#>  $ B: chr  "a" "b" "c" "d" ...
tmp <- tempfile()
write.csv2(df, tmp, row.names = FALSE)
cat(readLines(tmp), sep = "\n")
#> "A";"B"
#> "1";"a"
#> "2";"b"
#> "3";"c"
#> "4";"d"
#> "5";"e"
str(fread(tmp))
#> Classes ‘data.table’ and 'data.frame':   5 obs. of  2 variables:
#>  $ A: chr  "1" "2" "3" "4" ...
#>  $ B: chr  "a" "b" "c" "d" ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
str(fread(tmp, colClasses = c("integer", "character")))
#> Classes ‘data.table’ and 'data.frame':   5 obs. of  2 variables:
#>  $ A: chr  "1" "2" "3" "4" ...
#>  $ B: chr  "a" "b" "c" "d" ...
#>  - attr(*, ".internal.selfref")=<externalptr> 
fread(tmp, colClasses = c("integer", "character"), verbose = TRUE)
#> Input contains no \n. Taking this to be a filename to open
#> File opened, filesize is 0.000000 GB.
#> Memory mapping ... ok
#> Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
#> Positioned on line 1 after skip or autostart
#> This line is the autostart and not blank so searching up for the last non-blank ... line 1
#> Detecting sep ... ';'
#> Detected 2 columns. Longest stretch was from line 1 to line 6
#> Starting data input on line 1 (either column names or first row of data). First 10 characters: "A";"B"
#> All the fields on line 1 are character fields. Treating as the column names.
#> Count of eol: 6 (including 1 at the end)
#> Count of sep: 5
#> nrow = MIN( nsep [5] / ncol [2] -1, neol [6] - nblank [1] ) = 5
#> Type codes (   first 5 rows): 44
#> Type codes: 44 (after applying colClasses and integer64)
#> Type codes: 44 (after applying drop or select (if supplied)
#> Allocating 2 column slots (2 - 0 dropped)
#> Read 5 rows. Exactly what was estimated and allocated up front
#>    0.000s ( 13%) Memory map (rerun may be quicker)
#>    0.000s (  7%) sep and header detection
#>    0.000s (  3%) Count rows (wc -l)
#>    0.000s ( 72%) Column type detection (first, middle and last 5 rows)
#>    0.000s (  2%) Allocation of 5x2 result (xMB) in RAM
#>    0.000s (  2%) Reading data
#>    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#>    0.000s (  0%) Coercing data already read in type bumps (if any)
#>    0.000s (  1%) Changing na.strings to NA
#>    0.000s        Total
#>    A B
#> 1: 1 a
#> 2: 2 b
#> 3: 3 c
#> 4: 4 d
#> 5: 5 e
#> Warning in 'fread(tmp, colClasses = c("integer", "character"), verbose = TRUE)':
#>   Column 1 ('A') has been detected as type 'character'. Ignoring request from colClasses to read as 'integer' (a lower type) since NAs (or loss of precision) may result.
feature request fread

All 8 comments

I don't see a reason for it to be default behavior to ignore quotation marks, even if being told by the user to treat it as an integer.

If anything, this is a feature request for something like a force option.

From ?fread::colClasses:

It won't downgrade a column to a lower type since NAs would result. You have to coerce such columns afterwards yourself, if you really require data loss.

I think if user specifies the colClasses param he knows what he is doing.

Then you are a more capable user than I. I for one have by mistake set colClasses incorrectly more times than I am happy to admit.

I'm much more satisfied getting a warning message than I would be finding later on that my data has been destroyed.

What's wrong with just fixing up ex-post?

dup of #572.

Argh.. wrong number in commit again!

I think if the user has set colClasses it should not be ignored - this is unintuitive behavior. It would be better to throw an error or warning if the function thinks the user is wrong. Although I agree with
artemklevtsov. The function should strip any quotes and attempt to read the column as an integer (in his example).

quoted integers can be read correctly as integers in the latest dev version

Was this page helpful?
0 / 5 - 0 ratings