Users can already use dplyr::summarise_all on .SD. But two people said it would still be nice to see built in to data.table :
https://twitter.com/Pertplus1/status/1151090599648550912
https://twitter.com/ftvalen/status/1151099177402818561
Next step would be to benchmark: https://twitter.com/MattDowle/status/1151959251029389312
Related (if I understand what summarise_all does): https://github.com/Rdatatable/data.table/issues/782 and https://github.com/Rdatatable/data.table/issues/1063 as colwise()
Here goes an example using
A. dplyr with summarise_at
B. datatable with lapply
C. datatable with summarise_all
B seems to do the trick -- C is by far the slowest solution. An example:
library(magrittr)
library(purrr)
library(data.table)
library(dplyr)
library(tictoc)
library(devtools)
# nrow of data
n = 1e6
# number of columns by type
p_int = 100
p_double = 100
p_char = 30
# create columns
set.seed(100)
# grouping columns
id1 = sample(letters, n, rep=T)
id2 = sample(letters, n, rep=T)
# filter columns
f = sample(12, n, rep=T)
# integer columns
ints = matrix(sample(100L, p_int*n, rep=T), ncol=p_int) %>% as.data.table()
names(ints) = paste0("int",seq_along(ints))
# double columns
doubles = (matrix(sample(100, p_double*n, rep=T), ncol=p_double)/7) %>% as.data.table()
names(doubles) = paste0("double",seq_along(doubles))
# character columns
chars = matrix(sample(letters, p_char*n, rep=T), ncol=p_char) %>% as.data.table()
names(chars) = paste0("char",seq_along(chars))
# bind columns
dat = cbind(id1, id2, f, ints, doubles, chars)
# columns to summarise
cols_sum = map_lgl(dat, is.numeric) %>% which() %>% names()
# summarise with mean,max,sum
tic.clearlog()
# A: dplyr
tic("A")
out_a = as_tibble(dat) %>%
dplyr::filter(f < 6) %>%
group_by(id1, id2) %>%
summarise_at(cols_sum, list(max = max, sum = sum, mean = mean))
toc(log=T)
# B: datatable
tic("B")
func_sum = function(x) c(max = max(x), sum = sum(x), mean = mean(x))
out_b = dat[f < 6,
by = .(id1,id2),
as.list(unlist(lapply(.SD, func_sum))),
.SDcols = cols_sum]
toc(log=T)
# C: datatable with summarise_all
tic("C")
out_c = dat[f < 6,
by = .(id1,id2),
summarise_all(.SD, list(max = max, sum = sum, mean = mean)),
.SDcols = cols_sum]
toc(log=T)
# time
tic.log(format = TRUE)
Session info:
> devtools::session_info()
Session info ---------------------------------------------------
setting value
version R version 3.6.0 (2019-04-26)
system x86_64, mingw32
ui RStudio (1.2.1335)
language (EN)
collate Spanish_Argentina.1252
tz America/Buenos_Aires
date 2019-07-18
Packages -------------------------------------------------------
package * version date source
base * 3.6.0 2019-04-26 local
compiler 3.6.0 2019-04-26 local
datasets * 3.6.0 2019-04-26 local
devtools * 1.13.6 2018-06-27 CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 CRAN (R 3.5.1)
graphics * 3.6.0 2019-04-26 local
grDevices * 3.6.0 2019-04-26 local
memoise 1.1.0 2017-04-21 CRAN (R 3.5.1)
methods * 3.6.0 2019-04-26 local
packrat 0.4.9-3 2018-06-01 CRAN (R 3.5.1)
stats * 3.6.0 2019-04-26 local
tools 3.6.0 2019-04-26 local
utils * 3.6.0 2019-04-26 local
withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
>
Thanks @ftvalentini - that's beyond my wildest dreams of reproducibility! Just one thing: can you add the timings that you see too please. That way, when we reproduce it in some weeks time, we know we get the same ballpark timings as you.
running times:
> tic.log(format = TRUE)
[[1]]
[1] "A: 21.08 sec elapsed"
[[2]]
[1] "B: 5.25 sec elapsed"
[[3]]
[1] "C: 868.97 sec elapsed"
Note: I had to stop execution of C before it ended :/
dplyr seems to be deprecating summarise_all so not sure if we still want to have this... we could eventually use their API for same functionality, whatever it is now, but it is a little bit risky as it may eventually get deprecated as well.