Using on= as on=.(x == y, a <= b) -- as simple as that..
>=, >, <= and <.mult="first" and mult="last"?data.tableThen extend https://github.com/Rdatatable/data.table/issues/1068
This would be awesome if implemented !
Any plans to support syntax like on = .(x == y, a < b +3)?
Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.
Done, see #1639
@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?
Here's an example where it'll fail:
require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))
dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a", :
# by-joins are not yet implemented for multi-group non-equi-joins.
The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.
Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.
Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.
Updated all SO posts linked.
great work!!
It seems that for this to work, one has to specify j.
Here is a small example
> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
year v
1: 1991 -0.4465306
2: 1992 0.3355444
3: 1993 1.7731734
4: 1994 0.5285609
5: 1995 -1.7025382
6: 1996 0.3752937
7: 1997 2.0677762
8: 1998 0.6509314
9: 1999 0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
start end
1: 1986 1991
2: 1987 1992
3: 1988 1993
4: 1989 1994
5: 1990 1995
6: 1991 1996
7: 1992 1997
8: 1993 1998
9: 1994 1999
10: 1995 2000
what I want to get can be accomplished by foverlaps:
> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
start end year v year_hlp
1: 1986 1991 1991 -0.4465306 1991
2: 1987 1992 1991 -0.4465306 1991
3: 1987 1992 1992 0.3355444 1992
4: 1988 1993 1991 -0.4465306 1991
5: 1988 1993 1992 0.3355444 1992
6: 1988 1993 1993 1.7731734 1993
7: 1989 1994 1991 -0.4465306 1991
8: 1989 1994 1992 0.3355444 1992
9: 1989 1994 1993 1.7731734 1993
10: 1989 1994 1994 0.5285609 1994
11: 1990 1995 1991 -0.4465306 1991
12: 1990 1995 1992 0.3355444 1992
13: 1990 1995 1993 1.7731734 1993
14: 1990 1995 1994 0.5285609 1994
15: 1990 1995 1995 -1.7025382 1995
16: 1991 1996 1991 -0.4465306 1991
17: 1991 1996 1992 0.3355444 1992
18: 1991 1996 1993 1.7731734 1993
......
But I thought I could simply do this:
> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]
However, I got an error:
Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end), :
'by' or 'keyby' is supplied but not j
Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.
@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's _GROUP BY_ to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.
@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?
@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.
set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]
set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE
Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.
@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.
One more interesting observation, the inequality join is faster than foverlaps.
Benchmark:
TODO: add MonetDBLite benchmark later.
# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query
subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject
system.time(
nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s
require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does
# some preprocessing and therefore should be included in the total run time)
# Thanks @jangorecki
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql))
# 53.3s
system.time({
setkey(subject, id, range1, range2)
folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps
Another non-equi joins comparison with RSQLite: (not an interval join)
system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s
sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql))
# 50.7s