Hi,
I think I'm encountering a weird bug where R crashes as I try to do a non-equi join. Apologies for not being able to create a minimal reproducible example. I have attached two data.tables, both with 10,000 rows and up to 4 columns.
Here is the code to (hopefully) reproduce the error
library(data.table)
DT1 <- readRDS('DT1.rds')
DT2 <- readRDS('DT2.rds')
# This does not work, R crashes on my system.
DT1[
DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]
# This works
set.seed(1)
n <- 1e3
DT1 <- DT1[sample(.N, n)]
DT2 <- DT2[sample(.N, n)]
DT1[
DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]
Output of sessionInfo() ----
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_New Zealand.1252 LC_CTYPE=English_New Zealand.1252 LC_MONETARY=English_New Zealand.1252
[4] LC_NUMERIC=C LC_TIME=English_New Zealand.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.12.0
loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2 yaml_2.2.0
Thank you. Let me know if there is anything I can provide to aid in debugging.
Also tested on the latest dev version, same error persists.
data.table 1.12.1 IN DEVELOPMENT built 2019-02-14 09:56:44 UTC;
Might be unrelated but you should run setDT(DT1) after readRDS
Thanks Michael. I set DT1 and DT2 both with setDT and reran the commands, but R still crashed.
Were you able to reproduce this?
Works well up until 1.11.8. Seg faults from 1.12.0.
Seems like this is the commit that breaks: https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db#diff-3f6e5ca10e702fb2c499a882aa3447e0
Thanks @arunsrinivasan! That helps. I'm using 1.11.8 for now. Will upgrade to the latest dev once your fix is merged.
Also, I wanted to thank you all for your great work in data.table. It has been a pleasure to use and its RAM efficiency has helped us avoid the purchase of a new computer that can support more than 64GB RAM for as long as was possible.
I just ran into this issue. super happy to find it has already been logged and patched. tested on my end against dev version and can confirm it solves my issue as well. you folks rock!!!
And it will be landing on CRAN within days/hours
I was able to work around this in my scenario by adding a pre-filter on x. I haven't fully thought this through, but this might be a possible generic optimization to reduce working set. If not, just ignore ;)
my original code was something like:```
d1[d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
allow.cartesian = T, by = .EACHI,
.(.N, res = sum(Cont.Low + Cont.High))]
by adding a filters on d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)] I was able to get this to work:
d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)][d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
allow.cartesian = T, by = .EACHI,
.(.N, res=sum(Cont.Low + Cont.High))]
this filter on x is logically implied by the non-equi join conditions and doesn't actually affect the result, but seems (in my scenario) to bypass the memory allocation.
obviously there are all kinds of considerations, like is computing the min and max and applying the filter worth it and such. as I said, feel free to ignore if not useful