Would this be possible/advisable even in grouping settings?
I have in mind doing lapply(.SD, ...) for .SD of 100s or 1000s of columns -- fundamentally, IINM, we have to store the output in memory first before assigning, so that the for loop approach would be much more memory-efficient.
Ideally it'd be done without any speed impact. Any considerations I'm missing?
another issue asks for that, also mentioning support for doing that by group, see #1441.
Sounds interesting.
In trying to test this idea, I assumed that dt[, `:=`(a = a + 1, b = b+1)] would be more memory efficient than dt[, names(dt) := lapply(.SD, '+', 1)]. This was not the case. The only time memory allocation was improved was using dt[, single_var := single_var + 1].
Take this with a grain of salt - I used bench::mark which I have heard may have troubles measuring memory allocations.
Benchmark data
library(data.table)
dt = data.table(a = rnorm(1e6),
b = rnorm(1e6))
bench::mark(copy(dt))
#> # A tibble: 1 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 copy(dt) 4.19ms 4.6ms 202. 15.3MB 242.
bench::mark(
new_cols_functional = copy(dt)[, `:=`(a_new = a + 2, b_new = b + 2)],
new_cols_lhs_assign = copy(dt)[, paste(names(dt), 'new', sep = '_') := .(a + 2, b + 2)],
new_cols_lapply = copy(dt)[, paste(names(dt), 'new', sep = '_') := lapply(.SD, '+', 2)]
)
#> # A tibble: 3 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 new_cols_functional 14ms 15.4ms 63.7 47.3MB 170.
#> 2 new_cols_lhs_assign 13.2ms 14.9ms 61.4 45.8MB 102.
#> 3 new_cols_lapply 13.1ms 13.8ms 68.4 45.8MB 137.
bench::mark(
functional = copy(dt)[, `:=`(a = a + 2, b = b+2)],
lhs_assign = copy(dt)[, c('a', 'b') := .(a + 2, b + 2)],
lapply_assign = copy(dt)[, names(dt) := lapply(.SD, '+', 2)]
,
using_set = {
dt_copy = copy(dt)
set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
set(dt_copy, , j = 2L, dt_copy[['b']] + 2)
dt_copy
}
)
#> # A tibble: 4 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 functional 12.9ms 13ms 73.0 45.8MB 92.9
#> 2 lhs_assign 12.8ms 13.1ms 71.2 45.8MB 71.2
#> 3 lapply_assign 12.9ms 13.1ms 72.3 45.8MB 67.5
#> 4 using_set 12.2ms 13.2ms 73.4 45.8MB 165.
bench::mark(
copy(dt)[, a := a + 2],
using_set = {
dt_copy = copy(dt)
set(dt_copy, , j = 1L, dt_copy[['a']] + 2)
dt_copy
}
)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 copy(dt)[, `:=`(a, a + 2)] 6.66ms 6.8ms 138. 22.9MB 71.2
#> 2 using_set 8.23ms 8.76ms 105. 30.5MB 42.2
I think mem_alloc is not exactly what I'm after:
mem_alloc-bench_bytesTotal amount of memory allocated by R while running the expression. Memory allocated outside the R heap, e.g. bymalloc()or new directly is not tracked, take care to avoid misinterpreting the results if running code that may do this.
First, the thing about malloc, assign.c is using that a lot:
grep -r "alloc[(]" src/assign.c
src/assign.c: buf = (int *) R_alloc(length(cols), sizeof(int));
src/assign.c: char *s4 = (char*) malloc(strlen(c1) + 3);
src/assign.c: char *s5 = (char*) malloc(strlen(tc2) + 5); //4 * '_' + \0
src/assign.c: int *tt = (int *)R_alloc(ndelete, sizeof(int));
src/assign.c: SEXP *temp = (SEXP *)malloc(nAdd * sizeof(SEXP *));
src/assign.c: saveds = (SEXP *)malloc(nalloc * sizeof(SEXP));
src/assign.c: savedtl = (R_len_t *)malloc(nalloc * sizeof(R_len_t));
src/assign.c: char *tmp = (char *)realloc(saveds, nalloc*sizeof(SEXP));
src/assign.c: tmp = (char *)realloc(savedtl, nalloc*sizeof(R_len_t));
Second, it's tracking _total_ memory allocation, whereas what I had in mind is _peak_ memory allocation -- rather than create a big 4Gb list and then pass it to assign, I want to go 40Mb at a time for 100 columns (say), so total memory allocation is roughly the same, but at peak we only need 40Mb RAM
The update by reference allows users to access previous groupings in by. If your idea of lapply(...) to set(...) were extended to Map(...) and/or mapply(...), users would be able to access previous values allocated.
Here's a shortened version of this SO post:
library(data.table)
ops = fread(" RES FUN VAR1 VAR2
P1 fx1 var1 var2
P2 fx2 P1 var1
P3 fx1 P2 var2 ")
vals = fread("ID var1 var2 P1 P2 P3
1 2 1 0 0 0
2 3 10 0 0 0
4 6 1 0 0 0 ")
fx1 = function(x, y) x * 2
fx2 = function(x, y) x / y / x
vals[, (ops[["RES"]]) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), ops[['FUN']], ops[['VAR1']], ops[['VAR2']])]
vals
## OOPS! P2 and P3 are unexpected!
ID var1 var2 P1 P2 P3
<int> <int> <int> <num> <num> <num>
1: 1 2 1 4 NaN 0
2: 2 3 10 6 NaN 0
3: 4 6 1 12 NaN 0
The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings. The downside is that this allocates more memory than necessary because the dt repeats:
vals = fread("ID var1 var2 P1 P2 P3
1 2 1 0 0 0
2 3 10 0 0 0
4 6 1 0 0 0 ")
vals[, c("P1", "P2", "P3") := lapply(.SD, as.numeric), .SDcols = c("P1", "P2", "P3")]
ops[,
vals[,
(RES) := Map(function(fx, v1, v2) do.call(fx, unname(.SD[, c(v1, v2), with = FALSE])), FUN, VAR1, VAR2)
]
, by = seq_len(nrow(ops))
]
## we tripled the memory needed
seq_len ID var1 var2 P1 P2 P3
<int> <int> <int> <int> <num> <num> <num>
1: 1 1 2 1 4 0.0000000 0.0000000
2: 1 2 3 10 6 0.0000000 0.0000000
3: 1 4 6 1 12 0.0000000 0.0000000
4: 2 1 2 1 4 0.5000000 0.0000000
5: 2 2 3 10 6 0.3333333 0.0000000
6: 2 4 6 1 12 0.1666667 0.0000000
7: 3 1 2 1 4 0.5000000 1.0000000
8: 3 2 3 10 6 0.3333333 0.6666667
9: 3 4 6 1 12 0.1666667 0.3333333
##when this is all we wanted
vals
ID var1 var2 P1 P2 P3
<int> <int> <int> <num> <num> <num>
1: 1 2 1 4 0.5000000 1.0000000
2: 2 3 10 6 0.3333333 0.6666667
3: 4 6 1 12 0.1666667 0.3333333
Just something to consider with the main downsides being that this makes these less thread safe.
The downside is that this allocates more memory than necessary because the
dtrepeats:
Maybe return NULL in j ro reduce memory usage?
The current way to address in the context of data.table, as proposed by chinsoon22, is to use a by statement to have access of prior groupings.
Btw I am not the first to do this. Learned it from a SO post by Matt. And also there is a recent regression on his post as it no longer works without a copy first. see #4184
Not to hijack this too much, returning NULL helps a lot. But using a loop with set really shines in performance. I am still uncertain about bench::mark and malloc, but below are some benchmarks.
## 3 rows
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl>
1 by_no_null 2.39ms 2.69ms 360. 73.5KB 2.04
2 by_with_null 2.51ms 2.87ms 286. 72.9KB 2.06
3 for_loop_set 4.45ms 4.67ms 209. 11.1KB 2.03
4 for_loop_dt 6.5ms 6.88ms 143. 61.8KB 4.15
## 300 rows
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:> <bch:> <dbl> <bch:byt>
1 by_no_null 6.14ms 6.71ms 143. 227.9KB
2 by_with_null 6.45ms 7.29ms 80.9 102.9KB
3 for_loop_set 4.52ms 4.7ms 163. 25.4KB
4 for_loop_dt 6.5ms 6.9ms 145. 69KB
##30,000 rows
# A tibble: 4 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 by_no_null 455.38ms 455.52ms 2.20 14.68MB
2 by_with_null 435.25ms 451.06ms 2.22 2.54MB
3 for_loop_set 5.18ms 5.46ms 169. 1.38MB
4 for_loop_dt 7.14ms 7.53ms 132. 765.05KB
Code to reproduce above
library(data.table)
ops = fread(" RES FUN VAR1 VAR2
P1 fx1 var1 var2
P2 fx2 P1 var1
P3 fx1 P2 var2 ")
vals = fread("ID var1 var2 P1 P2 P3
1 2 1 0 0 0
2 3 10 0 0 0
4 6 1 0 0 0 ")
vals = vals[rep(seq_len(.N), 10000L)] ##use or ignore
vals[, paste0("P", 1:3) := lapply(.SD, as.numeric), .SDcols=paste0("P", 1L:3L)]
bench::mark(
by_no_null = {
ops[,
vals[, (RES) := as.numeric(mapply(function(x, y)
match.fun(FUN)(x, y),
get(VAR1), get(VAR2)))],
1L:nrow(ops)]
vals
},
by_with_null = {
ops[, {
vals[, (RES) := as.numeric(mapply(function(x, y)
match.fun(FUN)(x, y),
get(VAR1), get(VAR2)))]
NULL
},
1L:nrow(ops)]
vals
},
for_loop_set = {
res <- ops[['RES']]
fx <- ops[['FUN']]
var1 <- ops[['VAR1']]
var2 <- ops[['VAR2']]
for (i in seq_len(nrow(ops))) {
set(vals, j = res[i], value = do.call(fx[i], list(vals[[var1[i]]], vals[[var2[i]]])))
}
vals
} ,
for_loop_dt = {
res <- ops[['RES']]
fx <- ops[['FUN']]
var1 <- ops[['VAR1']]
var2 <- ops[['VAR2']]
for (i in seq_len(nrow(ops))) {
vals[, (res[i]) := do.call(fx[i], unname(.SD)), .SDcols = c(var1[i], var2[i])]
}
vals
}
)
Most helpful comment
I think
mem_allocis not exactly what I'm after:First, the thing about
malloc,assign.cis using that a lot:Second, it's tracking _total_ memory allocation, whereas what I had in mind is _peak_ memory allocation -- rather than create a big 4Gb list and then pass it to
assign, I want to go 40Mb at a time for 100 columns (say), so total memory allocation is roughly the same, but at peak we only need 40Mb RAM