Data.table: na.strings="0" is not permitted in fread

Created on 11 Jun 2018 · 12Comments · Source: Rdatatable/data.table

When I use fread, I get this error (NAstring <<0>> is recognized as type boolean, this is not permitted)
na.strings = c("", "-", "_", "..", "...", "--", "*", "*" ,
"n/a", "n.a.", "#VALUE!", "0", "Inf", "-Inf", "NAN", "r", "e")

If I removed "0" from na.strings, fread is not getting error.

However, cells that contains "r" or "e", is not converted to NA, and their columns are characters.

Please advise.

fread

Source

msgoussi

Most helpful comment

Agreed @shrektan. Nor na.strings = "1" though I guess that's a pretty obscure use case.

MichaelChirico on 13 Oct 2020

👍2

All 12 comments

Please include your data.table package version & system info, and the file itself, if you can.

MichaelChirico on 11 Jun 2018

data.table::fread("a,b,c\n0,1,2", na.strings = "0", verbose = TRUE)
#> Input contains a \n or is "". Taking this to be text input (not a filename)
#> [01] Check arguments
#>   Using 12 threads (omp_get_max_threads()=12, nth=12)
#> Error in data.table::fread("a,b,c\n0,1,2", na.strings = "0", verbose = TRUE):
#>   freadMain: NAstring <<0>> is recognized as type boolean, this is not permitted.

HughParsonage on 11 Jun 2018

👍2

Sys.info()
sysname release version nodename
"Windows" ">= 8 x64" "build 9200" "IRT-310677-Z440"
machine login user effective_user
"x86-64" "310677" "310677" "310677"

data.table package: 1.11.4
The file is so huge, it is almost 1.44 GB

msgoussi on 11 Jun 2018

Thank @HughParsonage, darn that looks no bueno.

I guess there should be some interaction with logical01 but there doesn't appear to be.

MichaelChirico on 11 Jun 2018

Is it a bug?

msgoussi on 11 Jun 2018

@msgoussi yes, you can workaround it with something like

data.table::fread("a,b,c\n0,1,2")[a==0, a:=NA]

jangorecki on 11 Jun 2018

If i have 500 columns and i need to clean columns and consider the following strings , c("", "-", "_", "..", "...", "--", "**", "" ,
"n/a", "n.a.", "#VALUE!", "0", "Inf", "-Inf", "NAN", "r", "e"), as na. The way around will not look good. This is my opnion

msgoussi on 11 Jun 2018

you're correct, but until the bug is fixed, you can at least get rid of all
the other NA strings with fread, then only replace 0 in the code.

On Tue, Jun 12, 2018, 5:36 AM msgoussi notifications@github.com wrote:

If i have 500 columns and i need to clean colmuns and consider the
following strings , c("", "-", "_", "..", "...", "--", "**", "" ,
"n/a", "n.a.", "#VALUE!", "0", "Inf", "-Inf", "NAN", "r", "e"), as na. The
way around will not look good. This is my opnion

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2927#issuecomment-396395084,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdRcsWHnfXUeJFvIlQ3a7WLUeMyyRks5t7uLBgaJpZM4UiP9z
.

MichaelChirico on 12 Jun 2018

I have an R package on Github, ribailey/gghybrid, which includes a function to read genomic data files with potentially millions of columns, and one of the main softwares for producing these input files, PLINK, always codes missing data as zero. I use fread within my function to read in the data and declare missing values. This means I can't read in the most common file type people might want to use, due to the bug described here. Has there been any progress on this? Many thanks, Richard.

ribailey on 3 Jul 2020

Removing || strcmp(ch,"1")==0 || strcmp(ch,"0")==0 from fread.c seems like an option to "fix" this.

Only breaks 1 test case which is explicitly testing for na.strings = '1'.

Interaction with logical01 seems also legit.

data.table::fread("a,b,c\n0,1,2\n1,0,2", na.strings = "0", logical01=T)

        a      b     c                                                                                                     
    <lgcl> <lgcl> <int>                                                                                                  
1:     NA   TRUE     2                                                                                                  
2:   TRUE     NA     2

ben-schwen on 12 Oct 2020

I think the expected behavior is that we should not allow logical01=TRUE and na.strings = "0" at the same time. If this is agreed, I'm happy to start to file a PR for this.