Data.table: `[.data.table` needs to be more modular.

Created on 3 Oct 2014 · 10Comments · Source: Rdatatable/data.table

internals

Source

arunsrinivasan

All 10 comments

can you elaborate on this?

rsaporta on 4 Oct 2014

My suggestion regarding this issue would be to isolate [.data.table and bmerge by adding separate layer-function for preparing i data (key manipulation, attributes, etc.) from i argument (and possibly some others).
At the current moment the auto.index feature can fits in that new intermediate function but there are multiple FR which can be implemented there to not increase complexity of [.

jangorecki on 10 Jan 2015

Any ideas on this?

Currently [.data.table has a huge chunks of code like if(missing(i)) ..., if(missing(j)), # Grouping ... which has quite a lot nested conditions.
I think the separation of evaluation of _control flow_ conditions from the computation on the dataset can bring more readability of the code. It also allows better isolation of _scenarios_ to compute, so results in a more modular code.
High level view on the current [.data.table implementation base on few samples of code:

...
if (!missing(i)) {
        ...
        if (is.call(isub) && isub[[1L]] == as.name("order") && getOption("datatable.optimize") >= 1) {...}
        ...
        if (is.call(isub) && isub[[1L]] == quote(forder)) {...} else if (is.call(isub) && getOption("datatable.auto.index") &&
                 as.character(isub[[1L]]) %chin% c("==","%in%") &&
                 is.name(isub[[2L]]) &&
                 (isub2<-as.character(isub[[2L]])) %chin% names(x) && 
                 is.null(attr(x, '.data.table.locked'))) {...}
} else {...}
...
if (missing(j)) {...} else {
        ...
        if (is.call(jsub) && jsub[[1L]]==as.name(":=")) {...}
        ...
}
...

Above code in a proposed structure:

# control flow conditions evaluation
do_i = !missing(i)
do_j = !missing(j)
optimize_i = do_i && getOption("datatable.optimize") >= 1 && is.call(isub) && isub[[1L]] == as.name("order")
x_unlocked = is.null(attr(x, '.data.table.locked'))
create_index = do_i && getOption("datatable.auto.index") && x_unlocked && is.call(isub) && as.character(isub[[1L]]) %chin% c("==","%in%") && is.name(isub[[2L]]) && as.character(isub[[2L]]) %chin% names(x)
update_by_ref = do_j && is.call(jsub) && jsub[[1L]]==as.name(":=")

# computation on dataset
if(do_i){
        if(optimize_i) {...}
        if(create_index) {...}
        ...
} else {...}
...
if (!do_j) {...} else {
        ...
        if (update_by_ref) {...}
        ...
}
...

In above code calls kind of is.call(x) && x[[1L]]==... could be replaced by some kind of _if_call_and_if_matching_to_ wrapper function.
Additionally besides of _control flow_ and _computation on dataset_ sections there could be a section to compute some variables like leftcols, rightcols, byval, ansvars, etc. or auto.index preparing (see upper post).
If we would be able to use only single return at the end of function then it can also simplify the visibility/invisibility of returned object.

If you would consider switching to such structure I can prepare PR after 1.9.6 release. The PR would be only a first step, just wrapping current structure into new shape, yet enough to cover all unit tests and be update to reduce huge chunks to more isolated code blocks.

jangorecki on 22 Aug 2015

👍1

@jangorecki thanks. Yes it should be more functional, agreed. I'll be working on this for v1.9.8. Have spent some time on this already.

arunsrinivasan on 23 Aug 2015

@arunsrinivasan you can push your dev version to new branch, maybe put a task list of features TODO here, so the _goals_ will be defined. It could additionally bring some feedback from the potential contributors.

jangorecki on 23 Aug 2015

👍1

Ideally would be to isolate processing to atomic functions, like process.i(subi), process.j(subj).
Not sure how big would be overhead in solution like this but maybe it can be easily avoided by simplifying code. This is skeleton how this isolation could be done, just run in console: https://gist.github.com/jangorecki/1ef34c0ccdc947ae7a29c158330e7e92
Example meta.data.table is like data.table but remember recent transformation (i, j, by) and source data, together with call information and statistics.

jangorecki on 17 Mar 2018

@jangorecki I have been working on code to make the .SDcols more modular. Specifically, my aim is to allow it to reuse similar code between j, by and .SDcols by using a new function. Users would be able to use the same selection code in any of the above and have the same returns - it's largely that way currently except the same bits of code are in three separate places. The function is now working with the .SDcols - the other variables are more tangled and need to be checked more thoroughly.

I saw you assigned yourself some related items. Should I hold off?

ColeMiller1 on 13 Jan 2020

I assigned to myself because I am working on it too. Will try to push today, so we can work together.

jangorecki on 13 Jan 2020

👍1

I have created modular_dt2 branch. This is a tangled web so this attempt was more of a light refactoring than full-fledged modularity. Most of the work was copying and pasting elsewhere. In one case, we have if (notjoin) i = !i with a note to update that to allow for which_. I have incorporated that in this branch.

Help / feedback welcome. Most of my comments include !!! to distinguish changes I made versus what comments already existed.

Single row subset - 1 row out of 1e5 rows

| Branch | expression | minimum time (microseconds)|
|---------|-----|---------|
|1.12.9 master | CsubsetDT | 102 |
|1.12.9 modular_dt2 | dt[1L] | 130 |
|1.12.9 with2 | dt[1L, with = c(i = FALSE)] | 173 |

column select - 1e5 rows; 3 columns

| Branch | expression | minimum time (microseconds)|
|---------|-----|---------|
| 1.12.9 master | dt[] | 15 |
| 1.12.9 modular_dt2 | dt[, .SD] | 67 - note, shallow is used when is.null(irows) && jsub == quote(.SD) so this might not be the fairest. This is mainly a proof-of-concept of how we can exit j early |
| 1.12.9 master | dt[, .SD] | 2190 |

looping and selecting single row:

##1.12.9 master
 ##  user  system elapsed 
 ## 18.60    4.53   24.18 

## 1.12.9 with2
##   user  system elapsed 
##  12.69    4.61   17.84

## 1.12.9 modular_dt2
##   user  system elapsed 
##  10.71    3.20   14.39

negating boolean index from parent.frame

library(data.table)
n = 1e7 ; grps = 1e2
set.seed(123L)
setDTthreads(1L)
dt = data.table(x = sample(grps, n, TRUE), y = runif(n), z = runif(n))
ind = sample(c(TRUE, FALSE), n, TRUE)
bench::mark(dt[!ind])
##1.12.9 master
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      133ms    191ms      5.82     192MB     11.6

##1.12.9 modular_dt2
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 dt[!ind]      115ms    138ms      7.16     154MB     12.5

ColeMiller1 on 4 Jun 2020

@jangorecki, how does this actually get implemented? I was trying to do a smaller PR for the dt[!logical_ind] but the git looks like I changed everything.

FWIW, recompiling removing OpenMP (for all tests) provides the following timings:

library(data.table)
##setDTthreads(1L) ## doesn't matter because this isn't compiled. 
DoSomething <- function(row) someCalculation <- row[["v1"]] + 1
allIterations <- data.table(v1 = runif(1e5), v2 = runif(1e5))

system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, ]))
system.time(for (r in 1:nrow(allIterations)) DoSomething(allIterations[r, , with=c(i=FALSE)]))

|version | expr|time in seconds|
|---------|-----|----|
|master | dt[r]| 17.8|
| with2 | dt[r, with = c(i=FALSE)]| 6.6|
|modular_dt2 |dt[r]|6.4|
|modular_dt2|dt[1L]|5.9|
|data.frame | df[r, ]| 5.0 |
|CsubsetDT | fx(r) | 2.3 |

So in summary, modular_dt2 is a 3x speed up without any new arguments for a simple row subset. I am not saying that this is what you were envisioning (I was hoping for something more modular / impressive, too), but I do not know the path to realize a modular dt without hurting the open PRs or, for reviewers, wondering what was change versus just re-arranged.

ColeMiller1 on 6 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings