Data.table: summarise_all requested

Created on 18 Jul 2019 · 5Comments · Source: Rdatatable/data.table

Users can already use dplyr::summarise_all on .SD. But two people said it would still be nice to see built in to data.table :
https://twitter.com/Pertplus1/status/1151090599648550912
https://twitter.com/ftvalen/status/1151099177402818561
Next step would be to benchmark: https://twitter.com/MattDowle/status/1151959251029389312

feature request

Source

mattdowle

All 5 comments

Related (if I understand what summarise_all does): https://github.com/Rdatatable/data.table/issues/782 and https://github.com/Rdatatable/data.table/issues/1063 as colwise()

franknarf1 on 18 Jul 2019

👍1

Here goes an example using
A. dplyr with summarise_at
B. datatable with lapply
C. datatable with summarise_all

B seems to do the trick -- C is by far the slowest solution. An example:

library(magrittr)
library(purrr)
library(data.table)
library(dplyr)
library(tictoc)
library(devtools)

# nrow of data
n = 1e6
# number of columns by type
p_int = 100
p_double = 100
p_char = 30

# create columns
set.seed(100)
# grouping columns
id1 = sample(letters, n, rep=T)
id2 = sample(letters, n, rep=T)
# filter columns
f = sample(12, n, rep=T)
# integer columns
ints = matrix(sample(100L, p_int*n, rep=T), ncol=p_int) %>% as.data.table()
names(ints) = paste0("int",seq_along(ints))
# double columns
doubles = (matrix(sample(100, p_double*n, rep=T), ncol=p_double)/7) %>% as.data.table() 
names(doubles) = paste0("double",seq_along(doubles))
# character columns
chars = matrix(sample(letters, p_char*n, rep=T), ncol=p_char) %>% as.data.table() 
names(chars) = paste0("char",seq_along(chars))

# bind columns
dat = cbind(id1, id2, f, ints, doubles, chars) 

# columns to summarise
cols_sum = map_lgl(dat, is.numeric) %>% which() %>% names()


# summarise with mean,max,sum

tic.clearlog()
# A: dplyr
tic("A")
out_a = as_tibble(dat) %>% 
  dplyr::filter(f < 6) %>% 
  group_by(id1, id2) %>%
  summarise_at(cols_sum, list(max = max, sum = sum, mean = mean))
toc(log=T)

# B: datatable
tic("B")
func_sum = function(x) c(max = max(x), sum = sum(x), mean = mean(x))
out_b = dat[f < 6,
          by = .(id1,id2),
          as.list(unlist(lapply(.SD, func_sum))),
          .SDcols = cols_sum]
toc(log=T)

# C: datatable with summarise_all
tic("C")
out_c = dat[f < 6,
          by = .(id1,id2),
          summarise_all(.SD, list(max = max, sum = sum, mean = mean)),
          .SDcols = cols_sum]
toc(log=T)

# time
tic.log(format = TRUE)

Session info:

> devtools::session_info()
Session info ---------------------------------------------------
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 system   x86_64, mingw32             
 ui       RStudio (1.2.1335)          
 language (EN)                        
 collate  Spanish_Argentina.1252      
 tz       America/Buenos_Aires        
 date     2019-07-18                  

Packages -------------------------------------------------------
 package   * version date       source        
 base      * 3.6.0   2019-04-26 local         
 compiler    3.6.0   2019-04-26 local         
 datasets  * 3.6.0   2019-04-26 local         
 devtools  * 1.13.6  2018-06-27 CRAN (R 3.5.1)
 digest      0.6.18  2018-10-10 CRAN (R 3.5.1)
 graphics  * 3.6.0   2019-04-26 local         
 grDevices * 3.6.0   2019-04-26 local         
 memoise     1.1.0   2017-04-21 CRAN (R 3.5.1)
 methods   * 3.6.0   2019-04-26 local         
 packrat     0.4.9-3 2018-06-01 CRAN (R 3.5.1)
 stats     * 3.6.0   2019-04-26 local         
 tools       3.6.0   2019-04-26 local         
 utils     * 3.6.0   2019-04-26 local         
 withr       2.1.2   2018-03-15 CRAN (R 3.5.0)
>

ftvalentini on 19 Jul 2019

👍1

Thanks @ftvalentini - that's beyond my wildest dreams of reproducibility! Just one thing: can you add the timings that you see too please. That way, when we reproduce it in some weeks time, we know we get the same ballpark timings as you.

mattdowle on 19 Jul 2019

running times:

> tic.log(format = TRUE)
[[1]]
[1] "A: 21.08 sec elapsed"

[[2]]
[1] "B: 5.25 sec elapsed"

[[3]]
[1] "C: 868.97 sec elapsed"

Note: I had to stop execution of C before it ended :/

ftvalentini on 19 Jul 2019

👍1

dplyr seems to be deprecating summarise_all so not sure if we still want to have this... we could eventually use their API for same functionality, whatever it is now, but it is a little bit risky as it may eventually get deprecated as well.