I'm getting some weird crashes exclusively on the i386 version of R in Windows both on win-builder and locally. All the crashes seem to have memory-related error messages, but the errors/crashes occur inconsistently (not every time I run the code) and the error seems to alternate randomly between "cannot allocate vector of length", "cannot allocate a vector of size", and an actual crash ("R encountered a fatal error" in Rstudio)
Here's a minimal reproducible example:
y <- data.table(addr_id=c(1L,2L,2L,3L,5L),
ppt_id=c(1L,1L,1L,2L,2L),
addr_start=c(1L,10L,12L,1L,1L),
addr_end=c(9L,11L,14L,17L,10L))
x <- data.table(addr_id=rep(1L:4L,each=3L),
exposure_start=rep(c(1L,8L,15L),times=4L),
exposure_end=rep(c(7L,14L,21L),times=4L),
exposure_value=c(rnorm(12))
)
setkey(x,addr_id, exposure_start,exposure_end)
setkey(y,addr_id, addr_start,addr_end)
for(i in 1:1000){
foverlaps(x,y,nomatch=NULL)
gc()
}
As you can see the datasets are small so despite i386's memory limitations this shouldn't be causing problems. The crash even happens with the gc() in the loop.
Pretty sure this happens whether nomatch=NULL or the default.
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2 yaml_2.2.1 intervalaverage_0.1.0
[5] data.table_1.12.8
Thanks for the report.
Is the problem reproducible on recent development version?
Did you try to replace foverlaps with non equi join and see if problem is still there?
This above issue is reproducible in the current development version of data.table.
I tried an equivalent non-equi join as you suggested and I'm not seeing crashes or errors with the following code on either development or release data.table on i386 R.
for(i in 1:1000){
x[y,on=c("addr_id","exposure_start<=addr_end","exposure_end>=addr_start"),nomatch=NULL]
gc()
}
for(i in 1:1000){
x[y,on=c("addr_id","exposure_start<=addr_end","exposure_end>=addr_start"),nomatch=NULL]
}
I've been able to reproduce this issue with foverlaps on a different windows machine using the current dev of data.table (again only on the i386 version of R)
"data.table 1.12.9 IN DEVELOPMENT built 2020-06-26 22:21:26 UTC; root using 2 threads (see ?getDTthreads). Latest news: r-datatable.com"
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.9
loaded via a namespace (and not attached):
[1] compiler_4.0.2
Thank you for providing extra info.
I connected to a windows machine and tried reproducing the problem, but couldn't. I am not getting any crash.
R version 4.0.2 (2020-06-22)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.9
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2
I tried few times, should that be enough?
In your example there is rnorm call but no set.seed, that makes the example less portable, this is minor thing of course.
I have no idea how to proceed then, I don't have any other Windows machine to try that out.
Could you provide flags that are being used during compilation?
install.packages("data.table", type = "source", repos = "https://Rdatatable.gitlab.io/data.table")
#*** arch - i386
#"C:/rtools40/mingw32/bin/"gcc -I"C:/R-40~1.2/include" -DNDEBUG -fopenmp -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 -mstackrealign -c assign.c -o assign.o
and
cat(readLines(system.file(package="data.table", "cc")), sep="\n")
#CC=gcc -std=gnu99
#CFLAGS=-g -O2 -fdebug-prefix-map=/build/r-base-aGvNeb/r-base-4.0.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g
*** arch - i386
"C:/rtools40/mingw32/bin/"gcc -I"C:/PROGRA~1/R/R-40~1.2/include" -DNDEBUG -fopenmp -O2 -Wall -std=gnu99 -mfpmath=sse -msse2 -mstackrealign -c assign.c -o assign.o
CC=gcc -std=gnu99
CFLAGS=-g -O2 -fdebug-prefix-map=/build/r-base-aGvNeb/r-base-4.0.0=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g
Sorry I pulled that from an existing test or something and forgot the seed, but the seed doesn't matter since I've been able to reproduce the error without that column even existing. Here's a slightly more minimal reprex:
library(data.table)
y <- data.table(addr_id=c(1L,2L,2L,3L,5L),
ppt_id=c(1L,1L,1L,2L,2L),
addr_start=c(1L,10L,12L,1L,1L),
addr_end=c(9L,11L,14L,17L,10L))
x <- data.table(addr_id=rep(1L:4L,each=3L),
exposure_start=rep(c(1L,8L,15L),times=4L),
exposure_end=rep(c(7L,14L,21L),times=4L)
)
setkey(x,addr_id, exposure_start,exposure_end)
setkey(y,addr_id, addr_start,addr_end)
foverlaps(x,y)
Something is definitely weird though since there are sessions where I can't get it to happen at all but other sessions where it happens the first time I run foverlaps on these datasets.
Actually, it seems like this error will not occur in the session in which the package is installed.
try this:
remove.packages("data.table")install.packages("data.table", type="source", repos="https://Rdatatable.gitlab.io/data.table")install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")I've replicated the above on both machines--the error doesn't occur in the session in which data.table was installed. Also it doesn't matter whether it was installed from source or binary.
@myoung3 Thanks, now I am able to reproduce issue.
To summarize:
Error/crash happens on Windows 32bit only. To reproduce one need to install data.table, close session, open new session, then run provided example. Running example in a session where package was installed is not always reproducible. Equivalent version of non-equi join does not crash/error in any case.
@jangorecki Great, glad to hear my boxes aren't haunted.
I've found that it's even a bit weirder than I described above. I can reproduce the issue in session in which data.table was installed, but it doesn't happen immediately.
In the session where data.table was installed I can get an error with the following, but this seems more stochastic (and might not even occur within 1000 reps):
library(data.table)
y <- data.table(addr_id=c(1L,2L,2L,3L,5L),
ppt_id=c(1L,1L,1L,2L,2L),
addr_start=c(1L,10L,12L,1L,1L),
addr_end=c(9L,11L,14L,17L,10L))
x <- data.table(addr_id=rep(1L:4L,each=3L),
exposure_start=rep(c(1L,8L,15L),times=4L),
exposure_end=rep(c(7L,14L,21L),times=4L)
)
setkey(x,addr_id, exposure_start,exposure_end)
setkey(y,addr_id, addr_start,addr_end)
replicate(1000,foverlaps(x,y))
which returns a "cannot allocate vector of size <> Gb"
Whereas in a new session after installation, for me at least, the following causes an instant crash (not an error) on just a single foverlaps call:
library(data.table)
y <- data.table(addr_id=c(1L,2L,2L,3L,5L),
ppt_id=c(1L,1L,1L,2L,2L),
addr_start=c(1L,10L,12L,1L,1L),
addr_end=c(9L,11L,14L,17L,10L))
x <- data.table(addr_id=rep(1L:4L,each=3L),
exposure_start=rep(c(1L,8L,15L),times=4L),
exposure_end=rep(c(7L,14L,21L),times=4L)
)
setkey(x,addr_id, exposure_start,exposure_end)
setkey(y,addr_id, addr_start,addr_end)
foverlaps(x,y)
I'll add that I tried replacing the rep() calls with a manual creation of the same vectors with c() calls but was still able to get the error, so it's not some sort of issue of lazy evaluation around rep (not even sure if that makes sense but I tried it so I figured I'd mention that)
@jangorecki yep that summary seems correct
Most helpful comment
@jangorecki Great, glad to hear my boxes aren't haunted.
I've found that it's even a bit weirder than I described above. I can reproduce the issue in session in which data.table was installed, but it doesn't happen immediately.
In the session where data.table was installed I can get an error with the following, but this seems more stochastic (and might not even occur within 1000 reps):
which returns a "cannot allocate vector of size <> Gb"
Whereas in a new session after installation, for me at least, the following causes an instant crash (not an error) on just a single foverlaps call: