I found a strange behavior of data.table. I found a situation where lapply applied on .SD changes the order of the column, messing then when one intend to assign the result.
An example:
Here the normal behavior
library(data.table)
plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]
All these lines give :
z y x
1: 3 2 1
which I need for example to reassign to c("y","x"). But if I do:
plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]
z x y
1: 3 1 2
Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x") if I assign the output of lapply to new vector of columns. It seems that the use of get in the i part of .SD triggers this bug.
Example of the effect of this on assignment:
plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
.SDcols = cols ,by = z][]
# x y z
# 1: 2 1 3
# Output of sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=French_Switzerland.1252 LC_CTYPE=French_Switzerland.1252
[3] LC_MONETARY=French_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=French_Switzerland.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.2.1 lmerTest_3.1-0 lme4_1.1-21 Matrix_1.2-17
[5] plyr_1.8.4 lubridate_1.7.4 readxl_1.3.1 data.table_1.12.2
[9] stringr_1.4.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 compiler_3.6.1
[4] pillar_1.4.2 nloptr_1.2.1 tools_3.6.1
[7] digest_0.6.21 boot_1.3-22 tibble_2.1.3
[10] gtable_0.3.0 nlme_3.1-140 lattice_0.20-38
[13] pkgconfig_2.0.3 rlang_0.4.0 rstudioapi_0.10
[16] withr_2.1.2 dplyr_0.8.3 grid_3.6.1
[19] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
[22] minqa_1.2.4 reshape2_1.4.3 purrr_0.3.2
[25] magrittr_1.5 scales_1.0.0 MASS_7.3-51.4
[28] splines_3.6.1 assertthat_0.2.1 colorspace_1.4-1
[31] numDeriv_2016.8-1.1 labeling_0.3 stringi_1.4.3
[34] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4
@dmongin thank you for detailed report. I can reproduce it on recent devel. As a workaround you can use now substitute rather than get
var = "x"
expr = substitute(
plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
list(.var=as.name(var))
)
print(expr)
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
# x
#})), .SDcols = cols, by = z][]
eval(expr)
# x y z
#1: 2 1 3
Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior.
Also be aware that some day instead of get(var) we should be able to use ..var, #2816, #3199
R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.
Note this is sort of covered by the verbose message:
options(datatable.verbose=TRUE)
plouf[...]
'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
Old: y,x
New: x,y
The issue is the message is a bit misleading/slightly confusing. Actually you've used get in a nested [ call, whereas this behavior is designed for get alongside .SD in the same [ call.
I'm not sure how useful it is to create fully robust tree parsing for edge cases like this that are better suited by metaprogramming as Jan suggested, especially given that the order of columns shouldn't really matter all that much.
Will leave this open for now to keep the discussion going a bit.
order does matter if you provide columns to assign on LHS of :=, then you end up with incorrectly labelled columns
@MichaelChirico as @jangorecki suggested, I discovered the problem in wrongly assigned column in my code (that I assigned on LHS with :=), and it took me a while to traceback the error. There could be cases were the user does not notice it. I find it quite dangerous, so I would vote for something robust.
I did not know about the metaprogramming, and I am not sure to see the advantage compared to get() (apart from avoiding this bug, but I find the code more hard to read and to use). If use of ..var avoid the bug, maybe the use of get() should be depreciated ?
The stack post, where I asked for workaround. I copied @jangorecki suggestion there:
https://stackoverflow.com/questions/59176071/data-table-bug-lapply-on-sd-reorder-columns-when-using-get-possible-workaro/59184106#59184106
@dmongin best place to get proper information on metaprogramming is official R language manual, there is a dedicated chapter to this feature: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Computing-on-the-language
Advantage is that you can construct any query you need to. If you are coding parametrized interface to some API (like data.table) you can construct parametrized queries as you would type the values of parameters by hand, and evaluate those. There are multiple packages that meant to help user to do metaprogramming in R introducing various ways to handle substitution ~ var, !!var, !!!var, {{ var }}}. I haven't use them really because base R metaprogramming works well, it is also the most reliable way, backward compatible, future proof, least prone to bugs.
Looking into it, changing this:
To:
if (any(c("get", "mget") %chin% av) && !(all(names_x %in% c(ansvars, allbyvars)) && is.null(names_i))) {
Would address this issue. Test 717 would need to be changed but all other tests check out for me when making the change.
The logic is that if all of the names of the data.table are already accounted for in our .SD and by and there are no names in i (meaning it's not a join), we wouldn't reorder the columns. Joins could be accounted for as well but it seems unlikely to have a join, :=, get(), and cols to update.
I am working on a PR to refactor the code to move the m(get) statement up a little bit, per the code comment:
What is the desired behaviour? Should there be a warning to users who use := and (m)get() or should we prevent the column switching from happening when all variables are already accounted for?
Most helpful comment
order does matter if you provide columns to assign on LHS of
:=, then you end up with incorrectly labelled columns