Data.table: Convert LHS := lapply(...) to a set(...) for loop internally?

Created on 2 Feb 2020 · 5Comments · Source: Rdatatable/data.table

Would this be possible/advisable even in grouping settings?

I have in mind doing lapply(.SD, ...) for .SD of 100s or 1000s of columns -- fundamentally, IINM, we have to store the output in memory first before assigning, so that the for loop approach would be much more memory-efficient.

Ideally it'd be done without any speed impact. Any considerations I'm missing?

another issue asks for that, also mentioning support for doing that by group, see #1441.

internals

Source

MichaelChirico

👍1

Most helpful comment

I think mem_alloc is not exactly what I'm after:

mem_alloc - bench_bytes Total amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. by malloc() or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.

First, the thing about malloc, assign.c is using that a lot:

grep -r "alloc[(]" src/assign.c
src/assign.c:    buf = (int *) R_alloc(length(cols), sizeof(int));
src/assign.c:      char *s4 = (char*) malloc(strlen(c1) + 3);
src/assign.c:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/assign.c:    int *tt = (int *)R_alloc(ndelete, sizeof(int));
src/assign.c:          SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
src/assign.c:  saveds = (SEXP *)malloc(nalloc * sizeof(SEXP));
src/assign.c:  savedtl = (R_len_t *)malloc(nalloc * sizeof(R_len_t));
src/assign.c:    char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP));
src/assign.c:    tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t));

Second, it's tracking _total_ memory allocation, whereas what I had in mind is _peak_ memory allocation -- rather than create a big 4Gb list and then pass it to assign, I want to go 40Mb at a time for 100 columns (say), so total memory allocation is roughly the same, but at peak we only need 40Mb RAM

MichaelChirico on 3 Feb 2020

👍3

All 5 comments

Sounds interesting.

In trying to test this idea, I assumed that dt[, `:=`(a = a + 1, b = b+1)] would be more memory efficient than dt[, names(dt) := lapply(.SD, '+', 1)]. This was not the case. The only time memory allocation was improved was using dt[, single_var := single_var + 1].

Take this with a grain of salt - I used bench::mark which I have heard may have troubles measuring memory allocations.

Benchmark data

library(data.table)
dt = data.table(a = rnorm(1e6),
                b = rnorm(1e6))

bench::mark(copy(dt))
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 copy(dt)     4.19ms    4.6ms      202.    15.3MB     242.

bench::mark(
  new_cols_functional = copy(dt)[, `:=`(a_new = a + 2, b_new = b + 2)],
  new_cols_lhs_assign = copy(dt)[, paste(names(dt), 'new', sep = '_') := .(a + 2, b + 2)],
  new_cols_lapply = copy(dt)[, paste(names(dt), 'new', sep = '_') := lapply(.SD, '+', 2)]
)
#> # A tibble: 3 x 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new_cols_functional     14ms   15.4ms      63.7    47.3MB     170.
#> 2 new_cols_lhs_assign   13.2ms   14.9ms      61.4    45.8MB     102.
#> 3 new_cols_lapply       13.1ms   13.8ms      68.4    45.8MB     137.

bench::mark(
  functional = copy(dt)[, `:=`(a = a + 2, b = b+2)],
  lhs_assign = copy(dt)[, c('a', 'b') := .(a + 2, b + 2)],
  lapply_assign = copy(dt)[, names(dt) := lapply(.SD, '+', 2)]
  ,
  using_set = {
    dt_copy = copy(dt)
    set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
    set(dt_copy, , j = 2L, dt_copy[['b']] + 2)
    dt_copy
  }
)
#> # A tibble: 4 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 functional      12.9ms     13ms      73.0    45.8MB     92.9
#> 2 lhs_assign      12.8ms   13.1ms      71.2    45.8MB     71.2
#> 3 lapply_assign   12.9ms   13.1ms      72.3    45.8MB     67.5
#> 4 using_set       12.2ms   13.2ms      73.4    45.8MB    165.

bench::mark(
  copy(dt)[, a := a + 2],
  using_set = {
    dt_copy = copy(dt)
    set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
    dt_copy
  }
)
#> # A tibble: 2 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 copy(dt)[, `:=`(a, a + 2)]   6.66ms    6.8ms      138.    22.9MB     71.2
#> 2 using_set                    8.23ms   8.76ms      105.    30.5MB     42.2

ColeMiller1 on 2 Feb 2020

I think mem_alloc is not exactly what I'm after:

mem_alloc - bench_bytes Total amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. by malloc() or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.

First, the thing about malloc, assign.c is using that a lot:

grep -r "alloc[(]" src/assign.c
src/assign.c:    buf = (int *) R_alloc(length(cols), sizeof(int));
src/assign.c:      char *s4 = (char*) malloc(strlen(c1) + 3);
src/assign.c:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/assign.c:    int *tt = (int *)R_alloc(ndelete, sizeof(int));
src/assign.c:          SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
src/assign.c:  saveds = (SEXP *)malloc(nalloc * sizeof(SEXP));
src/assign.c:  savedtl = (R_len_t *)malloc(nalloc * sizeof(R_len_t));
src/assign.c:    char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP));
src/assign.c:    tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t));

MichaelChirico on 3 Feb 2020

👍3

The update by reference allows users to access previous groupings in by. If your idea of lapply(...) to set(...) were extended to Map(...) and/or mapply(...), users would be able to access previous values allocated.

Here's a shortened version of this SO post:

library(data.table)
ops = fread(" RES FUN VAR1 VAR2    
                 P1 fx1 var1 var2
                 P2 fx2 P1 var1
                 P3 fx1 P2 var2 ")

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")

fx1 = function(x, y) x * 2
fx2 = function(x, y) x / y / x

vals[, (ops[["RES"]]) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), ops[['FUN']], ops[['VAR1']], ops[['VAR2']])]
vals

## OOPS! P2 and P3 are unexpected!
      ID  var1  var2    P1    P2    P3
   <int> <int> <int> <num> <num> <num>
1:     1     2     1     4   NaN     0
2:     2     3    10     6   NaN     0
3:     4     6     1    12   NaN     0

The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings. The downside is that this allocates more memory than necessary because the dt repeats:

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")
vals[, c("P1", "P2", "P3") := lapply(.SD, as.numeric), .SDcols = c("P1", "P2", "P3")]
ops[,
    vals[,
         (RES) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), FUN, VAR1, VAR2)
         ]
    , by = seq_len(nrow(ops))
    ]

## we tripled the memory needed
   seq_len    ID  var1  var2    P1        P2        P3
     <int> <int> <int> <int> <num>     <num>     <num>
1:       1     1     2     1     4 0.0000000 0.0000000
2:       1     2     3    10     6 0.0000000 0.0000000
3:       1     4     6     1    12 0.0000000 0.0000000
4:       2     1     2     1     4 0.5000000 0.0000000
5:       2     2     3    10     6 0.3333333 0.0000000
6:       2     4     6     1    12 0.1666667 0.0000000
7:       3     1     2     1     4 0.5000000 1.0000000
8:       3     2     3    10     6 0.3333333 0.6666667
9:       3     4     6     1    12 0.1666667 0.3333333

##when this is all we wanted
vals
      ID  var1  var2    P1        P2        P3
   <int> <int> <int> <num>     <num>     <num>
1:     1     2     1     4 0.5000000 1.0000000
2:     2     3    10     6 0.3333333 0.6666667
3:     4     6     1    12 0.1666667 0.3333333

Just something to consider with the main downsides being that this makes these less thread safe.

ColeMiller1 on 5 Mar 2020

The downside is that this allocates more memory than necessary because the dt repeats:

Maybe return NULL in j ro reduce memory usage?

The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings.

Btw I am not the first to do this. Learned it from a SO post by Matt. And also there is a recent regression on his post as it no longer works without a copy first. see #4184

chinsoon12 on 5 Mar 2020

Not to hijack this too much, returning NULL helps a lot. But using a loop with set really shines in performance. I am still uncertain about bench::mark and malloc, but below are some benchmarks.

## 3 rows
# A tibble: 4 x 13
  expression      min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
1 by_no_null   2.39ms 2.69ms      360.    73.5KB     2.04
2 by_with_null 2.51ms 2.87ms      286.    72.9KB     2.06
3 for_loop_set 4.45ms 4.67ms      209.    11.1KB     2.03
4 for_loop_dt   6.5ms 6.88ms      143.    61.8KB     4.15

## 300 rows
# A tibble: 4 x 13
  expression      min median `itr/sec` mem_alloc
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>
1 by_no_null   6.14ms 6.71ms     143.    227.9KB
2 by_with_null 6.45ms 7.29ms      80.9   102.9KB
3 for_loop_set 4.52ms  4.7ms     163.     25.4KB
4 for_loop_dt   6.5ms  6.9ms     145.       69KB

##30,000 rows
# A tibble: 4 x 13
  expression        min   median `itr/sec` mem_alloc
  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 by_no_null   455.38ms 455.52ms      2.20   14.68MB
2 by_with_null 435.25ms 451.06ms      2.22    2.54MB
3 for_loop_set   5.18ms   5.46ms    169.      1.38MB
4 for_loop_dt    7.14ms   7.53ms    132.    765.05KB

Code to reproduce above

library(data.table)
ops = fread(" RES FUN VAR1 VAR2
                 P1 fx1 var1 var2
                 P2 fx2 P1 var1
                 P3 fx1 P2 var2 ")

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")

vals = vals[rep(seq_len(.N), 10000L)] ##use or ignore

vals[, paste0("P", 1:3) := lapply(.SD, as.numeric), .SDcols=paste0("P", 1L:3L)]

bench::mark(
  by_no_null = {
    ops[,
        vals[, (RES) := as.numeric(mapply(function(x, y)
          match.fun(FUN)(x, y),
          get(VAR1), get(VAR2)))],
        1L:nrow(ops)]
    vals
  },
  by_with_null = {
    ops[, {
      vals[, (RES) := as.numeric(mapply(function(x, y)
        match.fun(FUN)(x, y),
        get(VAR1), get(VAR2)))]
      NULL
    },
    1L:nrow(ops)]
    vals
  },
  for_loop_set = {
    res <- ops[['RES']]
    fx <- ops[['FUN']]
    var1 <- ops[['VAR1']]
    var2 <- ops[['VAR2']]
    for (i in seq_len(nrow(ops))) {
      set(vals, j = res[i], value = do.call(fx[i], list(vals[[var1[i]]], vals[[var2[i]]])))
    }
    vals
  } ,
  for_loop_dt = {
    res <- ops[['RES']]
    fx <- ops[['FUN']]
    var1 <- ops[['VAR1']]
    var2 <- ops[['VAR2']]
    for (i in seq_len(nrow(ops))) {
      vals[, (res[i]) := do.call(fx[i], unname(.SD)), .SDcols = c(var1[i], var2[i])]
    }
    vals
  }
)

ColeMiller1 on 6 Mar 2020

Was this page helpful?

0 / 5 - 0 ratings