When dealing with discrete data, to prevent double counting, grouping or subsetting by semi-open intervals is often done. Is this a possible eventual low-priority enchancement?
I would suggest you to put very basic example, including expected output.
@jangorecki I think they mean: let incbounds have a length of 2, so...
library(data.table)
between(3:4, 3L, 4L, incbounds = c(TRUE, FALSE))
returns TRUE, FALSE (since left is closed, right is open).
@jangorecki, yes, similar to what @franknarf1 suggested, allow incbounds to be a vector of length two to allow for the >=/<= cases. It is only of interest for doubles, as in the case of integers, once can simply increment or decrement the open bound to make it closed.
I don't know the mechanics of Cbetween, but in the R-code based portion of beween, it may look something like:
function (x, lower, upper, incbounds = c(TRUE, TRUE))
{
SOME CODE TO BE PASSED TO C
}
else {
if (incbounds[[1]]}{
if incbounds[[2]] {
x >= lower & x <= upper
} else {
x >= lower & x < upper
}
} else {
if incbounds[[2]] {
x > lower & x <= upper
} else {
x > lower & x < upper
}
}
pretty sure this request is a duplicate of one that was closed
On Aug 23, 2017 7:50 PM, "Avraham Adler" notifications@github.com wrote:
@jangorecki https://github.com/jangorecki, yes, similar to what
@franknarf1 https://github.com/franknarf1 suggested, allow incbounds to
be a vector of length two to allow for the >=/<= cases. It is only of
interest for doubles, as in the case of integers, once can simply increment
or decrement the open bound to make it closed.I don't know the mechanics of Cbetween, but in the R-code based portion of
beween, it may look something like:function (x, lower, upper, incbounds = c(TRUE, TRUE))
{
SOME CODE TO BE PASSED TO C
}
else {
if (incbounds[[1]]}{
if incbounds[[2]] {
x >= lower & x <= upper
} else {
x >= lower & x < upper
}
} else {
if incbounds[[2]] {
x > lower & x <= upper
} else {
x > lower & x < upper
}
}—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/2317#issuecomment-324484572,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdRGO1qULAaRxMwC1ELGcnKFdJY4Yks5sbKy8gaJpZM4O_RWb
.
Yes, it's similar, but it doesn't ask for the infix operators which should make life easier. The +1/-1 to the bounds doesn't work for continuous cases, and if one wants to use data.table to calculate aggregate statistics on a partition of the data where the partition is based on a numeric variable, having semi-open intervals would be of value.
I would like to support this feature request for performance reasons.
A workaround for between() with semi-open intervals is a non-equi join but this seems to be a magnitude slower.
Inspired by this question on SO
set.seed(123L)
x = runif(1E6)
library(data.table)
DT = data.table(x)
lb <- 0.04
ub <- 0.5
microbenchmark::microbenchmark(
DT[x < ub & x > lb],
x[x < ub & x > lb],
between = DT[between(x, lb, ub, incbounds = FALSE)],
inrange = DT[inrange(x, lb, ub, incbounds = FALSE)],
nej1 = DT[.(lb, ub), on = .(x > V1, x < V2), .(x.x)],
nej2 = DT[DT[.(lb, ub), on = .(x > V1, x < V2), which = TRUE]]
)
Unit: milliseconds
expr min lq mean median uq max neval cld
DT[x < ub & x > lb] 15.558578 17.016286 23.45670 17.369076 19.85016 169.4464 100 ab
x[x < ub & x > lb] 15.189487 16.608143 28.77891 16.910587 19.88668 168.2302 100 b
between 8.419925 8.555183 14.91525 9.839388 10.52680 161.4220 100 a
inrange 80.817627 86.465971 99.10402 88.965187 104.58042 244.1886 100 c
nej1 144.998281 154.981905 171.04279 164.095535 170.83264 326.7286 100 d
nej2 136.329569 147.461538 163.94825 156.019137 165.92941 324.2748 100 d
md5-2766325c1e899f58df34f5fb096d15f0
setkey(DT, x)
microbenchmark::microbenchmark(
DT[x < ub & x > lb],
x[x < ub & x > lb],
psidom = DT[{ind <- DT[.(c(lb, ub)), which=TRUE, roll=TRUE, on=.(x)]; (ind[1]+1):ind[2]}],
between = DT[between(x, lb, ub, incbounds = FALSE)],
inrange = DT[inrange(x, lb, ub, incbounds = FALSE)],
nej1 = DT[.(lb, ub), on = .(x > V1, x < V2), .(x.x)],
nej2 = DT[DT[.(lb, ub), on = .(x > V1, x < V2), which = TRUE]]
)
md5-c610e04815cf22876ca26ccba85568e5
Unit: milliseconds
expr min lq mean median uq max neval cld
DT[x < ub & x > lb] 9.371565 10.975596 38.700709 11.626800 14.169141 264.4175 100 b
x[x < ub & x > lb] 15.164193 15.912562 38.015451 17.107335 19.640628 274.5802 100 b
psidom 3.654016 3.913354 7.151814 4.023114 4.502573 156.2290 100 a
between 5.624871 5.885929 17.252579 7.216537 7.779761 201.8898 100 ab
inrange 10.537598 10.909364 32.000093 12.272452 13.224624 208.6633 100 b
nej1 45.113488 46.838673 79.150484 48.698738 111.394732 252.0148 100 c
nej2 39.767544 40.737712 84.363850 42.301073 117.495852 339.6896 100 c
md5-064fa7b3c68f594c17f156eaeb7eb3f2
devtools::session_info()
Session info -------------------------------------------------------------------------------------------------------------
setting value
version R version 3.4.2 (2017-09-28)
system x86_64, linux-gnu
ui RStudio (1.1.383)
language (EN)
collate C.UTF-8
tz Zulu
date 2017-11-10
Packages -----------------------------------------------------------------------------------------------------------------
data.table * 1.10.4-3 2017-10-27 CRAN (R 3.4.2)
microbenchmark 1.4-2.1 2015-11-25 CRAN (R 3.4.2)
Confirming @UweBlock's benchmark holds on current deve
is there an agreement on the api proposed by @franknarf1 at the beginning of this issue?
No issues here.
Yes, makes sense
Most helpful comment
@jangorecki I think they mean: let
incboundshave a length of 2, so...returns TRUE, FALSE (since left is closed, right is open).