Data.table: Convert LHS := lapply(...) to a set(...) for loop internally?

Created on 2 Feb 2020  路  5Comments  路  Source: Rdatatable/data.table

Would this be possible/advisable even in grouping settings?

I have in mind doing lapply(.SD, ...) for .SD of 100s or 1000s of columns -- fundamentally, IINM, we have to store the output in memory first before assigning, so that the for loop approach would be much more memory-efficient.

Ideally it'd be done without any speed impact. Any considerations I'm missing?


another issue asks for that, also mentioning support for doing that by group, see #1441.

internals

Most helpful comment

I think mem_alloc is not exactly what I'm after:

mem_alloc - bench_bytes Total amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. by malloc() or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.

First, the thing about malloc, assign.c is using that a lot:

grep -r "alloc[(]" src/assign.c
src/assign.c:    buf = (int *) R_alloc(length(cols), sizeof(int));
src/assign.c:      char *s4 = (char*) malloc(strlen(c1) + 3);
src/assign.c:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/assign.c:    int *tt = (int *)R_alloc(ndelete, sizeof(int));
src/assign.c:          SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
src/assign.c:  saveds = (SEXP *)malloc(nalloc * sizeof(SEXP));
src/assign.c:  savedtl = (R_len_t *)malloc(nalloc * sizeof(R_len_t));
src/assign.c:    char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP));
src/assign.c:    tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t));

Second, it's tracking _total_ memory allocation, whereas what I had in mind is _peak_ memory allocation -- rather than create a big 4Gb list and then pass it to assign, I want to go 40Mb at a time for 100 columns (say), so total memory allocation is roughly the same, but at peak we only need 40Mb RAM

All 5 comments

Sounds interesting.

In trying to test this idea, I assumed that dt[, `:=`(a = a + 1, b = b+1)] would be more memory efficient than dt[, names(dt) := lapply(.SD, '+', 1)]. This was not the case. The only time memory allocation was improved was using dt[, single_var := single_var + 1].

Take this with a grain of salt - I used bench::mark which I have heard may have troubles measuring memory allocations.


Benchmark data

library(data.table)
dt = data.table(a = rnorm(1e6),
                b = rnorm(1e6))

bench::mark(copy(dt))
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 copy(dt)     4.19ms    4.6ms      202.    15.3MB     242.

bench::mark(
  new_cols_functional = copy(dt)[, `:=`(a_new = a + 2, b_new = b + 2)],
  new_cols_lhs_assign = copy(dt)[, paste(names(dt), 'new', sep = '_') := .(a + 2, b + 2)],
  new_cols_lapply = copy(dt)[, paste(names(dt), 'new', sep = '_') := lapply(.SD, '+', 2)]
)
#> # A tibble: 3 x 6
#>   expression               min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>          <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 new_cols_functional     14ms   15.4ms      63.7    47.3MB     170.
#> 2 new_cols_lhs_assign   13.2ms   14.9ms      61.4    45.8MB     102.
#> 3 new_cols_lapply       13.1ms   13.8ms      68.4    45.8MB     137.

bench::mark(
  functional = copy(dt)[, `:=`(a = a + 2, b = b+2)],
  lhs_assign = copy(dt)[, c('a', 'b') := .(a + 2, b + 2)],
  lapply_assign = copy(dt)[, names(dt) := lapply(.SD, '+', 2)]
  ,
  using_set = {
    dt_copy = copy(dt)
    set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
    set(dt_copy, , j = 2L, dt_copy[['b']] + 2)
    dt_copy
  }
)
#> # A tibble: 4 x 6
#>   expression         min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 functional      12.9ms     13ms      73.0    45.8MB     92.9
#> 2 lhs_assign      12.8ms   13.1ms      71.2    45.8MB     71.2
#> 3 lapply_assign   12.9ms   13.1ms      72.3    45.8MB     67.5
#> 4 using_set       12.2ms   13.2ms      73.4    45.8MB    165.

bench::mark(
  copy(dt)[, a := a + 2],
  using_set = {
    dt_copy = copy(dt)
    set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
    dt_copy
  }
)
#> # A tibble: 2 x 6
#>   expression                      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 copy(dt)[, `:=`(a, a + 2)]   6.66ms    6.8ms      138.    22.9MB     71.2
#> 2 using_set                    8.23ms   8.76ms      105.    30.5MB     42.2

I think mem_alloc is not exactly what I'm after:

mem_alloc - bench_bytes Total amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. by malloc() or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.

First, the thing about malloc, assign.c is using that a lot:

grep -r "alloc[(]" src/assign.c
src/assign.c:    buf = (int *) R_alloc(length(cols), sizeof(int));
src/assign.c:      char *s4 = (char*) malloc(strlen(c1) + 3);
src/assign.c:        char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/assign.c:    int *tt = (int *)R_alloc(ndelete, sizeof(int));
src/assign.c:          SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
src/assign.c:  saveds = (SEXP *)malloc(nalloc * sizeof(SEXP));
src/assign.c:  savedtl = (R_len_t *)malloc(nalloc * sizeof(R_len_t));
src/assign.c:    char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP));
src/assign.c:    tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t));

Second, it's tracking _total_ memory allocation, whereas what I had in mind is _peak_ memory allocation -- rather than create a big 4Gb list and then pass it to assign, I want to go 40Mb at a time for 100 columns (say), so total memory allocation is roughly the same, but at peak we only need 40Mb RAM

The update by reference allows users to access previous groupings in by. If your idea of lapply(...) to set(...) were extended to Map(...) and/or mapply(...), users would be able to access previous values allocated.

Here's a shortened version of this SO post:

library(data.table)
ops = fread(" RES FUN VAR1 VAR2    
                 P1 fx1 var1 var2
                 P2 fx2 P1 var1
                 P3 fx1 P2 var2 ")

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")

fx1 = function(x, y) x * 2
fx2 = function(x, y) x / y / x

vals[, (ops[["RES"]]) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), ops[['FUN']], ops[['VAR1']], ops[['VAR2']])]
vals

## OOPS! P2 and P3 are unexpected!
      ID  var1  var2    P1    P2    P3
   <int> <int> <int> <num> <num> <num>
1:     1     2     1     4   NaN     0
2:     2     3    10     6   NaN     0
3:     4     6     1    12   NaN     0

The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings. The downside is that this allocates more memory than necessary because the dt repeats:

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")
vals[, c("P1", "P2", "P3") := lapply(.SD, as.numeric), .SDcols = c("P1", "P2", "P3")]
ops[,
    vals[,
         (RES) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), FUN, VAR1, VAR2)
         ]
    , by = seq_len(nrow(ops))
    ]

## we tripled the memory needed
   seq_len    ID  var1  var2    P1        P2        P3
     <int> <int> <int> <int> <num>     <num>     <num>
1:       1     1     2     1     4 0.0000000 0.0000000
2:       1     2     3    10     6 0.0000000 0.0000000
3:       1     4     6     1    12 0.0000000 0.0000000
4:       2     1     2     1     4 0.5000000 0.0000000
5:       2     2     3    10     6 0.3333333 0.0000000
6:       2     4     6     1    12 0.1666667 0.0000000
7:       3     1     2     1     4 0.5000000 1.0000000
8:       3     2     3    10     6 0.3333333 0.6666667
9:       3     4     6     1    12 0.1666667 0.3333333

##when this is all we wanted
vals
      ID  var1  var2    P1        P2        P3
   <int> <int> <int> <num>     <num>     <num>
1:     1     2     1     4 0.5000000 1.0000000
2:     2     3    10     6 0.3333333 0.6666667
3:     4     6     1    12 0.1666667 0.3333333

Just something to consider with the main downsides being that this makes these less thread safe.

The downside is that this allocates more memory than necessary because the dt repeats:

Maybe return NULL in j ro reduce memory usage?

The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings.

Btw I am not the first to do this. Learned it from a SO post by Matt. And also there is a recent regression on his post as it no longer works without a copy first. see #4184

Not to hijack this too much, returning NULL helps a lot. But using a loop with set really shines in performance. I am still uncertain about bench::mark and malloc, but below are some benchmarks.

## 3 rows
# A tibble: 4 x 13
  expression      min median `itr/sec` mem_alloc `gc/sec`
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
1 by_no_null   2.39ms 2.69ms      360.    73.5KB     2.04
2 by_with_null 2.51ms 2.87ms      286.    72.9KB     2.06
3 for_loop_set 4.45ms 4.67ms      209.    11.1KB     2.03
4 for_loop_dt   6.5ms 6.88ms      143.    61.8KB     4.15

## 300 rows
# A tibble: 4 x 13
  expression      min median `itr/sec` mem_alloc
  <bch:expr>   <bch:> <bch:>     <dbl> <bch:byt>
1 by_no_null   6.14ms 6.71ms     143.    227.9KB
2 by_with_null 6.45ms 7.29ms      80.9   102.9KB
3 for_loop_set 4.52ms  4.7ms     163.     25.4KB
4 for_loop_dt   6.5ms  6.9ms     145.       69KB

##30,000 rows
# A tibble: 4 x 13
  expression        min   median `itr/sec` mem_alloc
  <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 by_no_null   455.38ms 455.52ms      2.20   14.68MB
2 by_with_null 435.25ms 451.06ms      2.22    2.54MB
3 for_loop_set   5.18ms   5.46ms    169.      1.38MB
4 for_loop_dt    7.14ms   7.53ms    132.    765.05KB


Code to reproduce above

library(data.table)
ops = fread(" RES FUN VAR1 VAR2
                 P1 fx1 var1 var2
                 P2 fx2 P1 var1
                 P3 fx1 P2 var2 ")

vals = fread("ID var1 var2 P1 P2 P3
              1 2 1 0 0 0 
              2 3 10 0 0 0
              4 6 1 0 0 0 ")

vals = vals[rep(seq_len(.N), 10000L)] ##use or ignore

vals[, paste0("P", 1:3) := lapply(.SD, as.numeric), .SDcols=paste0("P", 1L:3L)]

bench::mark(
  by_no_null = {
    ops[,
        vals[, (RES) := as.numeric(mapply(function(x, y)
          match.fun(FUN)(x, y),
          get(VAR1), get(VAR2)))],
        1L:nrow(ops)]
    vals
  },
  by_with_null = {
    ops[, {
      vals[, (RES) := as.numeric(mapply(function(x, y)
        match.fun(FUN)(x, y),
        get(VAR1), get(VAR2)))]
      NULL
    },
    1L:nrow(ops)]
    vals
  },
  for_loop_set = {
    res <- ops[['RES']]
    fx <- ops[['FUN']]
    var1 <- ops[['VAR1']]
    var2 <- ops[['VAR2']]
    for (i in seq_len(nrow(ops))) {
      set(vals, j = res[i], value = do.call(fx[i], list(vals[[var1[i]]], vals[[var2[i]]])))
    }
    vals
  } ,
  for_loop_dt = {
    res <- ops[['RES']]
    fx <- ops[['FUN']]
    var1 <- ops[['VAR1']]
    var2 <- ops[['VAR2']]
    for (i in seq_len(nrow(ops))) {
      vals[, (res[i]) := do.call(fx[i], unname(.SD)), .SDcols = c(var1[i], var2[i])]
    }
    vals
  }
)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jameslamb picture jameslamb  路  3Comments

tcederquist picture tcederquist  路  3Comments

andschar picture andschar  路  3Comments

jimhester picture jimhester  路  3Comments

alex46015 picture alex46015  路  3Comments