Data.table: binary search extensions to `<`, `<=`, `>`, `>=`

Created on 30 Nov 2015 · 16Comments · Source: Rdatatable/data.table

Using on= as on=.(x == y, a <= b) -- as simple as that..

[x] handle operators >=, >, <= and <.
[x] support for mult="first" and mult="last"
[x] add by=.EACHI support
[x] update ?data.table
[x] add a benchmark

Then extend https://github.com/Rdatatable/data.table/issues/1068

High feature request

Source

arunsrinivasan

All 16 comments

Update posts:

http://stackoverflow.com/q/36284173/559784
http://stackoverflow.com/q/36207861/559784
http://stackoverflow.com/q/36190921/559784
http://stackoverflow.com/q/36533139/2490497
http://stackoverflow.com/q/37163070/559784

arunsrinivasan on 29 Mar 2016

This would be awesome if implemented !

ywhcuhk on 1 Apr 2016

Any plans to support syntax like on = .(x == y, a < b +3)?

MichaelChirico on 10 Apr 2016

Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great.

arunsrinivasan on 10 Apr 2016

Done, see #1639

MichaelChirico on 10 Apr 2016

👍1

@arunsrinivasan can you give a quick example of .EACHI failing in combination with a non-equi join?

eantonya on 18 May 2016

Here's an example where it'll fail:

require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))

dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a",  : 
#   by-joins are not yet implemented for multi-group non-equi-joins.

The idea for non-equi joins is to split data.table x (here, dt) into groups where rows of all columns, independently, have values always in increasing order. Here, there'll be more than one such "group". In the examples you've worked so far, there must have been only one such group. If this is hard to follow, please wait a while. I'll have better explanations (with visualisations) for UseR'16 soon.

arunsrinivasan on 18 May 2016

Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result.

Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition.

eantonya on 19 May 2016

Updated all SO posts linked.

arunsrinivasan on 25 Jun 2016

🎉1

great work!!

ywhcuhk on 25 Jun 2016

It seems that for this to work, one has to specify j.

Here is a small example

> dt1 <- data.table(year=1991:2000, v=rnorm(10))
> dt1
    year          v
 1: 1991 -0.4465306
 2: 1992  0.3355444
 3: 1993  1.7731734
 4: 1994  0.5285609
 5: 1995 -1.7025382
 6: 1996  0.3752937
 7: 1997  2.0677762
 8: 1998  0.6509314
 9: 1999  0.6291038
10: 2000 -0.9639413
> dt2 <- data.table(start=dt1$year-5, end=dt1$year)
> dt2
    start  end
 1:  1986 1991
 2:  1987 1992
 3:  1988 1993
 4:  1989 1994
 5:  1990 1995
 6:  1991 1996
 7:  1992 1997
 8:  1993 1998
 9:  1994 1999
10:  1995 2000

what I want to get can be accomplished by foverlaps:

> dt1[, year_hlp:=year]
> setkey(dt1, year, year_hlp)
> setkey(dt2, start, end)
> ?foverlaps
> foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
    start  end year          v year_hlp
 1:  1986 1991 1991 -0.4465306     1991
 2:  1987 1992 1991 -0.4465306     1991
 3:  1987 1992 1992  0.3355444     1992
 4:  1988 1993 1991 -0.4465306     1991
 5:  1988 1993 1992  0.3355444     1992
 6:  1988 1993 1993  1.7731734     1993
 7:  1989 1994 1991 -0.4465306     1991
 8:  1989 1994 1992  0.3355444     1992
 9:  1989 1994 1993  1.7731734     1993
10:  1989 1994 1994  0.5285609     1994
11:  1990 1995 1991 -0.4465306     1991
12:  1990 1995 1992  0.3355444     1992
13:  1990 1995 1993  1.7731734     1993
14:  1990 1995 1994  0.5285609     1994
15:  1990 1995 1995 -1.7025382     1995
16:  1991 1996 1991 -0.4465306     1991
17:  1991 1996 1992  0.3355444     1992
18:  1991 1996 1993  1.7731734     1993
......

But I thought I could simply do this:

> dt1[dt2, on=.(year>=start, year<=end), by=.EACHI]

However, I got an error:

Error in `[.data.table`(dt1, dt2, on = .(year >= start, year <= end),  : 
  'by' or 'keyby' is supplied but not j

Why do I have to specify j? All I wanted is to do a simply inequality join like in SQL.

ywhcuhk on 28 Jun 2016

@ywhuofu if you want inequality join like in SQL then why you use by=.EACHI? it adds SQL's _GROUP BY_ to your query. If you are grouping then it is reasonable to have an aggregate functions, in case of grouping on join it is mandatory to provide j, even when doing it for equi-join.

jangorecki on 28 Jun 2016

@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead?

ywhcuhk on 28 Jun 2016

@ywhuofu Use set.seed when dealing with rnorm. I've changed year-5 to year-5L in your code to keep it as integer.

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]

set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE

Note the year=x.year in j argument, the rationale for using x. explicitly is described in #1615, otherwise you would get year equal to the value to which it was joined in i data.table.

jangorecki on 28 Jun 2016

@jangorecki thanks a lot for the detailed observations. It seems the key is allow.cartesian. I think it could be useful to keep a cookbook for data.table strategies to replicate SQL queries.

One more interesting observation, the inequality join is faster than foverlaps.

ywhcuhk on 28 Jun 2016

👍1

Benchmark:

TODO: add MonetDBLite benchmark later.

Data:

# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query

subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject

Non-equi joins:

system.time(
  nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s

findOverlaps

require(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does 
# some preprocessing and therefore should be included in the total run time)

RSQLite

# Thanks @jangorecki 
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 53.3s

foverlaps

system.time({
  setkey(subject, id, range1, range2)
  folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps

Another non-equi joins comparison with RSQLite: (not an interval join)

non-equi joins

system.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s

RSQLite

sql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql)) 
# 50.7s

arunsrinivasan on 30 Jun 2016

Was this page helpful?

0 / 5 - 0 ratings