Data.table: R crashes on non-equi join

Created on 14 Feb 2019  路  9Comments  路  Source: Rdatatable/data.table

Hi,

I think I'm encountering a weird bug where R crashes as I try to do a non-equi join. Apologies for not being able to create a minimal reproducible example. I have attached two data.tables, both with 10,000 rows and up to 4 columns.

Here is the code to (hopefully) reproduce the error

library(data.table)
DT1 <- readRDS('DT1.rds')
DT2 <- readRDS('DT2.rds')

# This does not work, R crashes on my system.
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]
# This works
set.seed(1)
n <- 1e3
DT1 <- DT1[sample(.N, n)]
DT2 <- DT2[sample(.N, n)]
DT1[
  DT2, on = .(Month<=MonthFuture, Month>=MonthPast, FlightDetails==FlightCode)
]

Output of sessionInfo() ----

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_New Zealand.1252  LC_CTYPE=English_New Zealand.1252        LC_MONETARY=English_New Zealand.1252
[4] LC_NUMERIC=C                         LC_TIME=English_New Zealand.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.0

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2    yaml_2.2.0

Thank you. Let me know if there is anything I can provide to aid in debugging.

bug non-equi joins regression

All 9 comments

Also tested on the latest dev version, same error persists.
data.table 1.12.1 IN DEVELOPMENT built 2019-02-14 09:56:44 UTC;

Might be unrelated but you should run setDT(DT1) after readRDS

Thanks Michael. I set DT1 and DT2 both with setDT and reran the commands, but R still crashed.

Were you able to reproduce this?

Works well up until 1.11.8. Seg faults from 1.12.0.

Seems like this is the commit that breaks: https://github.com/Rdatatable/data.table/commit/e59ba149f21bb35098ed0b527505d5ed6c5cb0db#diff-3f6e5ca10e702fb2c499a882aa3447e0

Thanks @arunsrinivasan! That helps. I'm using 1.11.8 for now. Will upgrade to the latest dev once your fix is merged.

Also, I wanted to thank you all for your great work in data.table. It has been a pleasure to use and its RAM efficiency has helped us avoid the purchase of a new computer that can support more than 64GB RAM for as long as was possible.

I just ran into this issue. super happy to find it has already been logged and patched. tested on my end against dev version and can confirm it solves my issue as well. you folks rock!!!

And it will be landing on CRAN within days/hours

I was able to work around this in my scenario by adding a pre-filter on x. I haven't fully thought this through, but this might be a possible generic optimization to reduce working set. If not, just ignore ;)

my original code was something like:```

d1[d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res = sum(Cont.Low + Cont.High))]

by adding a filters on d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)] I was able to get this to work:

  d1[rowid >= min(d2$start.rowid) & rowid <= max(d2$end.rowid)][d2, on = .(rowid >= start.rowid, rowid < end.rowid, Cont.Low <= Top, Cont.High >= Bottom),
        allow.cartesian = T, by = .EACHI, 
        .(.N, res=sum(Cont.Low + Cont.High))]

this filter on x is logically implied by the non-equi join conditions and doesn't actually affect the result, but seems (in my scenario) to bypass the memory allocation.

obviously there are all kinds of considerations, like is computing the min and max and applying the filter worth it and such. as I said, feel free to ignore if not useful

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DavidArenburg picture DavidArenburg  路  3Comments

st-pasha picture st-pasha  路  3Comments

mattdowle picture mattdowle  路  3Comments

nachti picture nachti  路  3Comments

tcederquist picture tcederquist  路  3Comments