Data.table: Allow semi-open intervals in `between`

Created on 23 Aug 2017 · 11Comments · Source: Rdatatable/data.table

When dealing with discrete data, to prevent double counting, grouping or subsetting by semi-open intervals is often done. Is this a possible eventual low-priority enchancement?

enhancement feature request

Source

aadler

Most helpful comment

@jangorecki I think they mean: let incbounds have a length of 2, so...

library(data.table)
between(3:4, 3L, 4L, incbounds = c(TRUE, FALSE))

returns TRUE, FALSE (since left is closed, right is open).

franknarf1 on 23 Aug 2017

👍4

All 11 comments

I would suggest you to put very basic example, including expected output.

jangorecki on 23 Aug 2017

@jangorecki I think they mean: let incbounds have a length of 2, so...

library(data.table)
between(3:4, 3L, 4L, incbounds = c(TRUE, FALSE))

returns TRUE, FALSE (since left is closed, right is open).

franknarf1 on 23 Aug 2017

👍4

@jangorecki, yes, similar to what @franknarf1 suggested, allow incbounds to be a vector of length two to allow for the >=/<= cases. It is only of interest for doubles, as in the case of integers, once can simply increment or decrement the open bound to make it closed.

I don't know the mechanics of Cbetween, but in the R-code based portion of beween, it may look something like:

function (x, lower, upper, incbounds = c(TRUE, TRUE)) 
{
    SOME CODE TO BE PASSED TO C
    }
    else {
          if (incbounds[[1]]}{
              if incbounds[[2]] {
                   x >= lower & x <= upper
              } else {
                  x >= lower & x < upper
              }
          } else {
              if incbounds[[2]] {
                   x > lower & x <= upper
              } else {
                  x > lower & x < upper
         }
}

aadler on 24 Aug 2017

pretty sure this request is a duplicate of one that was closed

On Aug 23, 2017 7:50 PM, "Avraham Adler" notifications@github.com wrote:

@jangorecki https://github.com/jangorecki, yes, similar to what
@franknarf1 https://github.com/franknarf1 suggested, allow incbounds to
be a vector of length two to allow for the >=/<= cases. It is only of
interest for doubles, as in the case of integers, once can simply increment
or decrement the open bound to make it closed.

I don't know the mechanics of Cbetween, but in the R-code based portion of
beween, it may look something like:

function (x, lower, upper, incbounds = c(TRUE, TRUE))
{
SOME CODE TO BE PASSED TO C
}
else {
if (incbounds[[1]]}{
if incbounds[[2]] {
x >= lower & x <= upper
} else {
x >= lower & x < upper
}
} else {
if incbounds[[2]] {
x > lower & x <= upper
} else {
x > lower & x < upper
}
}

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2317#issuecomment-324484572,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdRGO1qULAaRxMwC1ELGcnKFdJY4Yks5sbKy8gaJpZM4O_RWb
.

MichaelChirico on 24 Aug 2017

1489

MichaelChirico on 24 Aug 2017

Yes, it's similar, but it doesn't ask for the infix operators which should make life easier. The +1/-1 to the bounds doesn't work for continuous cases, and if one wants to use data.table to calculate aggregate statistics on a partition of the data where the partition is based on a numeric variable, having semi-open intervals would be of value.

aadler on 24 Aug 2017

I would like to support this feature request for performance reasons.

A workaround for between() with semi-open intervals is a non-equi join but this seems to be a magnitude slower.

Inspired by this question on SO

set.seed(123L)
x = runif(1E6)
library(data.table)
DT = data.table(x)
lb <- 0.04
ub <- 0.5
microbenchmark::microbenchmark(
  DT[x < ub & x > lb], 
  x[x < ub & x > lb],
  between = DT[between(x, lb, ub, incbounds = FALSE)],
  inrange = DT[inrange(x, lb, ub, incbounds = FALSE)],
  nej1 = DT[.(lb, ub), on = .(x > V1, x < V2), .(x.x)],
  nej2 = DT[DT[.(lb, ub), on = .(x > V1, x < V2), which = TRUE]]
)

Unit: milliseconds
                expr        min         lq      mean     median        uq      max neval  cld
 DT[x < ub & x > lb]  15.558578  17.016286  23.45670  17.369076  19.85016 169.4464   100 ab  
  x[x < ub & x > lb]  15.189487  16.608143  28.77891  16.910587  19.88668 168.2302   100  b  
             between   8.419925   8.555183  14.91525   9.839388  10.52680 161.4220   100 a   
             inrange  80.817627  86.465971  99.10402  88.965187 104.58042 244.1886   100   c 
                nej1 144.998281 154.981905 171.04279 164.095535 170.83264 326.7286   100    d
                nej2 136.329569 147.461538 163.94825 156.019137 165.92941 324.2748   100    d



md5-2766325c1e899f58df34f5fb096d15f0



setkey(DT, x)
microbenchmark::microbenchmark(
  DT[x < ub & x > lb], 
  x[x < ub & x > lb],
  psidom = DT[{ind <- DT[.(c(lb, ub)), which=TRUE, roll=TRUE, on=.(x)]; (ind[1]+1):ind[2]}],
  between = DT[between(x, lb, ub, incbounds = FALSE)],
  inrange = DT[inrange(x, lb, ub, incbounds = FALSE)],
  nej1 = DT[.(lb, ub), on = .(x > V1, x < V2), .(x.x)],
  nej2 = DT[DT[.(lb, ub), on = .(x > V1, x < V2), which = TRUE]]
)



md5-c610e04815cf22876ca26ccba85568e5



Unit: milliseconds
                expr       min        lq      mean    median         uq      max neval cld
 DT[x < ub & x > lb]  9.371565 10.975596 38.700709 11.626800  14.169141 264.4175   100  b 
  x[x < ub & x > lb] 15.164193 15.912562 38.015451 17.107335  19.640628 274.5802   100  b 
              psidom  3.654016  3.913354  7.151814  4.023114   4.502573 156.2290   100 a  
             between  5.624871  5.885929 17.252579  7.216537   7.779761 201.8898   100 ab 
             inrange 10.537598 10.909364 32.000093 12.272452  13.224624 208.6633   100  b 
                nej1 45.113488 46.838673 79.150484 48.698738 111.394732 252.0148   100   c
                nej2 39.767544 40.737712 84.363850 42.301073 117.495852 339.6896   100   c



md5-064fa7b3c68f594c17f156eaeb7eb3f2



devtools::session_info()
Session info -------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.2 (2017-09-28)
 system   x86_64, linux-gnu           
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  C.UTF-8                     
 tz       Zulu                        
 date     2017-11-10                  

Packages -----------------------------------------------------------------------------------------------------------------
 data.table     * 1.10.4-3 2017-10-27 CRAN (R 3.4.2)
 microbenchmark   1.4-2.1  2015-11-25 CRAN (R 3.4.2)