I have to work with this data that is inconveniently separated by a "卢", and I've been using data.table sucessfully so far, but this call:
data <- fread("data/RCI_MR_20200510.txt", sep = "卢")
works perfectly while on RStudio directly, but when executed from a .bat file (for automation purposes) that looks like this
"C:\Program Files\R\R-3.6.1\bin\x64\R.exe" CMD BATCH "C:\Users\myuser\mydir\project_simple.R"
it yields the following error in the .Rout
Error in fread("data/RCI_MR_20200510.txt", sep = "卢") :
nchar(sep) == 1L is not TRUE
Calls: %>% -> eval -> eval -> fread -> stopifnot
Execution halted
I reproduced the error sucessfully using any data with this separator. I guess it has to do something with the encoding. Any insight is appreciated!
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] lubridate_1.7.4 readxl_1.3.1 fuzzyjoin_0.1.5
[4] data.table_1.12.8 DBI_1.1.0 forcats_0.5.0
[7] stringr_1.4.0 dplyr_0.8.5 purrr_0.3.3
[10] readr_1.3.1 tidyr_1.0.2 tibble_2.1.3
[13] ggplot2_3.3.0 tidyverse_1.3.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.4 cellranger_1.1.0 pillar_1.4.3
[4] compiler_3.6.1 dbplyr_1.4.2 odbc_1.2.2
[7] tools_3.6.1 bit_1.1-15.2 jsonlite_1.6.1
[10] lifecycle_0.2.0 nlme_3.1-140 gtable_0.3.0
[13] lattice_0.20-38 pkgconfig_2.0.3 rlang_0.4.5
[16] reprex_0.3.0 cli_2.0.2 rstudioapi_0.11
[19] yaml_2.2.1 haven_2.2.0 withr_2.1.2
[22] xml2_1.2.5 httr_1.4.1 fs_1.3.2
[25] generics_0.0.2 vctrs_0.2.4 hms_0.5.3
[28] bit64_0.9-7 grid_3.6.1 tidyselect_1.0.0
[31] glue_1.3.2 R6_2.4.1 fansi_0.4.1
[34] blob_1.2.1 modelr_0.1.6 magrittr_1.5
[37] backports_1.1.5 scales_1.1.0 rvest_0.3.5
[40] assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4
[43] stringi_1.4.6 munsell_0.5.0 broom_0.5.5
[46] crayon_1.3.4
"卢" is not an ASCII symbol. As such, its byte-representation is encoding-dependent. For example, in UTF8 encoding it's 2-byte sequence \xC2\xAC, whereas in Latin-1 and Windows-1252 it's simply \xAC.
Based on your sessionInfo, your R runs under Windows-1252 locale. In this locale sep is a single-byte character \xAC and everything works smoothly (at least if the files that you're reading are also in the same encoding).
However, when run from the .bat file your R uses a different locale. If you run sessionInfo() from your batch file it should tell you what the locale is, but most likely it's either utf8 or utf16, which causes sep="卢" to become a multi-byte symbol.
I bet you have saved the file as utf8 encoding but the batch only supports native encoding. What you should do, in my opinion, is to use Rscript to call source with encoding being set to utf8 explicitly.
That is Rscript -e "source(xxx.r, encoding='UTF-8')"
Thank you! That was very clear. Changing "卢"for \xAC in the script solved the issue (At least with data.table, I now have problems with other characters like 脩 in other packages). My batch session was running this locale
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.1
This is not a data.table question, but if you have any advice on setting a locale (to not have to change every character for their byte representation) in a script running from a .bat file it would be really helpful.
Thanks again!
I have added my 5 cents already... as a Windows user myself, I suggest to save your R file in utf8 encoding and run the script using Rscript to call R in the command line, with the encoding argument being set explicitly.
Doesn't seems there is something we should do on fread side. Glad issue is resolved. Thanks for providing solutions. Closing then.
Thank you all guys, this was rather quick. @shrektan, your solution worked perfectly!! For anyone wondering, the syntax I used to solve this issue in the .bat file was
"C:\Program Files\R\R-3.6.1\bin\x64\Rscript.exe" -e "source('C:/my_script.r', encoding = 'UTF-8')"
Note that the argument inside source has to be inside ' ' and with / as a folder separator instead of \ .