Data.table: asynchronous code in j using grouping variable

Created on 14 May 2019 · 5Comments · Source: Rdatatable/data.table

In an expression using by data.table provides the grouping variable in j. When iterating over the groups this value is changed by reference. This leads to problems when performing asynchronous operations using the grouping variable in j:

> library(future)
> library(promises)
> library(data.table)
> data.table(letter=letters[1:3])[, {future({Sys.sleep(0.1);letter}) %...>% print; NULL}, by=letter]
Empty data.table (0 rows and 1 cols): letter
[1] "c"
[1] "c"
[1] "c"

Executing this loop in base R does yield the expected result.

> for(letter in letters[1:3]){ future({Sys.sleep(0.1);letter}) %...>% print; NULL}
[1] "a"
[1] "b"
[1] "c"

In the real world world this can yield to wrong ggtitles in plot printed in j when using rmarkdown.

Is the current behavior desired in any way, or is this a bug?

I could not find a related issue, or NEWS entry.
````

sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.12.3 promises_1.0.1 future_1.12.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 codetools_0.2-16 listenv_0.7.0 later_0.8.0 packrat_0.5.0 digest_0.6.18 R6_2.4.0 magrittr_1.5 evaluate_0.13 rlang_0.3.4 rmarkdown_1.12 tools_3.5.3 xfun_0.6 parallel_3.5.3 rsconnect_0.8.13 compiler_3.5.3 globals_0.12.4
[18] htmltools_0.3.6 knitr_1.22
````

duplicate

Source

jan-glx

Most helpful comment

Oh yes, apparently related issues are hard to find but very abundant:
On SO:
https://stackoverflow.com/questions/51057600/rmarkdown-data-table-plots-dont-match-after-compilation
https://stackoverflow.com/questions/42449257/saving-plots-in-a-data-table-list-column
https://stackoverflow.com/questions/43536583/data-table-not-returning-correct-plots
https://stackoverflow.com/questions/28400446/plotting-by-group-in-data-table
https://stackoverflow.com/questions/36413514/strange-error-plotting-by-group
https://stackoverflow.com/questions/27505794/values-of-the-wrong-group-are-used-when-using-plot-within-a-data-table-in-rs
and here:
https://github.com/Rdatatable/data.table/issues/3120
https://github.com/Rdatatable/data.table/issues/2050
https://github.com/Rdatatable/data.table/issues/1524

Below is a very simple instance of this problem. (it doesn't even need to be asynchronous)

library(data.table)
x <- list()
data.table(i=1:3,letter=letters[1:3])[, {x[[i]] <<- letter; NULL}, by=letter]
#> Empty data.table (0 rows and 1 cols): letter
x
#> [[1]]
#> [1] "c"
#> 
#> [[2]]
#> [1] "c"
#> 
#> [[3]]
#> [1] "c"

I understand that this might be hard to fix efficiently until reference counting becomes default for R, but I favor correctness over efficiency. 😇 (.GRP, .N, .SD and .BY are equally affected btw)

jan-glx on 17 May 2019

👍3

All 5 comments

This is probably a same problem as #3120

jangorecki on 14 May 2019

See also https://stackoverflow.com/questions/51057600/rmarkdown-data-table-plots-dont-match-after-compilation.
Using the copy() trick works on the example above.

Atrebas on 14 May 2019

👍1

Below is a very simple instance of this problem. (it doesn't even need to be asynchronous)

library(data.table)
x <- list()
data.table(i=1:3,letter=letters[1:3])[, {x[[i]] <<- letter; NULL}, by=letter]
#> Empty data.table (0 rows and 1 cols): letter
x
#> [[1]]
#> [1] "c"
#> 
#> [[2]]
#> [1] "c"
#> 
#> [[3]]
#> [1] "c"

I understand that this might be hard to fix efficiently until reference counting becomes default for R, but I favor correctness over efficiency. 😇 (.GRP, .N, .SD and .BY are equally affected btw)

jan-glx on 17 May 2019

👍3

I wonder whether it would be worth at least warning the user somewhere in the documentation that erroneous results may be produced in computations that depend on a variable that is implicitly modified by reference between grouping operations. It isn't obvious that computations such as these can interact between levels of the grouping by column. This can be remedied by using the copy() trick, but the documentation (reference semantics vignette and copy() help page) really only warns about unexpected modifications due to :=, set*, and changing names(DT). The phenomena here and in the other examples in https://github.com/Rdatatable/data.table/issues/3563#issuecomment-493537899 seem conceptually different from those mentioned in the documentation, as these appear to result from implementation-level modifications by reference that may not be known to the user.