can you elaborate on this?
My suggestion regarding this issue would be to isolate [.data.table and bmerge by adding separate layer-function for preparing i data (key manipulation, attributes, etc.) from i argument (and possibly some others).
At the current moment the auto.index feature can fits in that new intermediate function but there are multiple FR which can be implemented there to not increase complexity of [.
Any ideas on this?
Currently [.data.table has a huge chunks of code like if(missing(i)) ..., if(missing(j)), # Grouping ... which has quite a lot nested conditions.
I think the separation of evaluation of _control flow_ conditions from the computation on the dataset can bring more readability of the code. It also allows better isolation of _scenarios_ to compute, so results in a more modular code.
High level view on the current [.data.table implementation base on few samples of code:
...
if (!missing(i)) {
...
if (is.call(isub) && isub[[1L]] == as.name("order") && getOption("datatable.optimize") >= 1) {...}
...
if (is.call(isub) && isub[[1L]] == quote(forder)) {...} else if (is.call(isub) && getOption("datatable.auto.index") &&
as.character(isub[[1L]]) %chin% c("==","%in%") &&
is.name(isub[[2L]]) &&
(isub2<-as.character(isub[[2L]])) %chin% names(x) &&
is.null(attr(x, '.data.table.locked'))) {...}
} else {...}
...
if (missing(j)) {...} else {
...
if (is.call(jsub) && jsub[[1L]]==as.name(":=")) {...}
...
}
...
Above code in a proposed structure:
# control flow conditions evaluation
do_i = !missing(i)
do_j = !missing(j)
optimize_i = do_i && getOption("datatable.optimize") >= 1 && is.call(isub) && isub[[1L]] == as.name("order")
x_unlocked = is.null(attr(x, '.data.table.locked'))
create_index = do_i && getOption("datatable.auto.index") && x_unlocked && is.call(isub) && as.character(isub[[1L]]) %chin% c("==","%in%") && is.name(isub[[2L]]) && as.character(isub[[2L]]) %chin% names(x)
update_by_ref = do_j && is.call(jsub) && jsub[[1L]]==as.name(":=")
# computation on dataset
if(do_i){
if(optimize_i) {...}
if(create_index) {...}
...
} else {...}
...
if (!do_j) {...} else {
...
if (update_by_ref) {...}
...
}
...
In above code calls kind of is.call(x) && x[[1L]]==... could be replaced by some kind of _if_call_and_if_matching_to_ wrapper function.
Additionally besides of _control flow_ and _computation on dataset_ sections there could be a section to compute some variables like leftcols, rightcols, byval, ansvars, etc. or auto.index preparing (see upper post).
If we would be able to use only single return at the end of function then it can also simplify the visibility/invisibility of returned object.
If you would consider switching to such structure I can prepare PR after 1.9.6 release. The PR would be only a first step, just wrapping current structure into new shape, yet enough to cover all unit tests and be update to reduce huge chunks to more isolated code blocks.
@jangorecki thanks. Yes it should be more functional, agreed. I'll be working on this for v1.9.8. Have spent some time on this already.
@arunsrinivasan you can push your dev version to new branch, maybe put a task list of features TODO here, so the _goals_ will be defined. It could additionally bring some feedback from the potential contributors.
Ideally would be to isolate processing to atomic functions, like process.i(subi), process.j(subj).
Not sure how big would be overhead in solution like this but maybe it can be easily avoided by simplifying code. This is skeleton how this isolation could be done, just run in console: https://gist.github.com/jangorecki/1ef34c0ccdc947ae7a29c158330e7e92
Example meta.data.table is like data.table but remember recent transformation (i, j, by) and source data, together with call information and statistics.
@jangorecki I have been working on code to make the .SDcols more modular. Specifically, my aim is to allow it to reuse similar code between j, by and .SDcols by using a new function. Users would be able to use the same selection code in any of the above and have the same returns - it's largely that way currently except the same bits of code are in three separate places. The function is now working with the .SDcols - the other variables are more tangled and need to be checked more thoroughly.
I saw you assigned yourself some related items. Should I hold off?
I assigned to myself because I am working on it too. Will try to push today, so we can work together.
I have created modular_dt2 branch. This is a tangled web so this attempt was more of a light refactoring than full-fledged modularity. Most of the work was copying and pasting elsewhere. In one case, we have if (notjoin) i = !i with a note to update that to allow for which_. I have incorporated that in this branch.
Help / feedback welcome. Most of my comments include !!! to distinguish changes I made versus what comments already existed.
| Branch | expression | minimum time (microseconds)|
|---------|-----|---------|
|1.12.9 master | CsubsetDT | 102 |
|1.12.9 modular_dt2 | dt[1L] | 130 |
|1.12.9 with2 | dt[1L, with = c(i = FALSE)] | 173 |
| Branch | expression | minimum time (microseconds)|
|---------|-----|---------|
| 1.12.9 master | dt[] | 15 |
| 1.12.9 modular_dt2 | dt[, .SD] | 67 - note, shallow is used when is.null(irows) && jsub == quote(.SD) so this might not be the fairest. This is mainly a proof-of-concept of how we can exit j early |
| 1.12.9 master | dt[, .SD] | 2190 |
##1.12.9 master
## user system elapsed
## 18.60 4.53 24.18
## 1.12.9 with2
## user system elapsed
## 12.69 4.61 17.84
## 1.12.9 modular_dt2
## user system elapsed
## 10.71 3.20 14.39
library(data.table)
n = 1e7 ; grps = 1e2
set.seed(123L)
setDTthreads(1L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n), z = runif(n))
ind = sample(c(TRUE, FALSE), n, TRUE)
bench::mark(dt[!ind])
##1.12.9 master
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dt[!ind] 133ms 191ms 5.82 192MB 11.6
##1.12.9 modular_dt2
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 dt[!ind] 115ms 138ms 7.16 154MB 12.5
@jangorecki, how does this actually get implemented? I was trying to do a smaller PR for the dt[!logical_ind] but the git looks like I changed everything.
FWIW, recompiling removing OpenMP (for all tests) provides the following timings:
library(data.table)
##setDTthreads(1L) ## doesn't matter because this isn't compiled.
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))
|version | expr|time in seconds|
|---------|-----|----|
|master | dt[r]| 17.8|
| with2 | dt[r, with = c(i=FALSE)]| 6.6|
|modular_dt2 |dt[r]|6.4|
|modular_dt2|dt[1L]|5.9|
|data.frame | df[r, ]| 5.0 |
|CsubsetDT | fx(r) | 2.3 |
So in summary, modular_dt2 is a 3x speed up without any new arguments for a simple row subset. I am not saying that this is what you were envisioning (I was hoping for something more modular / impressive, too), but I do not know the path to realize a modular dt without hurting the open PRs or, for reviewers, wondering what was change versus just re-arranged.