Data.table: Calculations with many groups are much slower by default than with setDTthreads(1)

Created on 9 Mar 2020 · 10Comments · Source: Rdatatable/data.table

The below example shows how the default number of threads on my machine (18) results in data.table being much slower than if we get rid of the parallelism. I have a minimal reproducible example below.

This is related to but not exactly like the following issues:

# Minimal reproducible example

> NN = 1e5
> set.seed(1)
> DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
+                 grp2 = sample(5000L,NN,TRUE),
+                 V = rpois(NN, 10))
> 
> setDTthreads(18)
> system.time(DT[ , log(sum(V)), by = grp1,verbose=TRUE])
Detected that j uses these columns: V 
Finding groups using forderv ... forder.c received 100000 rows and 1 columns
0.710s elapsed (0.210s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 25000
0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'log(sum(V))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 17.674s for 25000 groups
  eval(j) took 0.072s for 25000 calls
17.8s elapsed (8.590s cpu) 
   user  system elapsed 
   8.80   27.58   18.49 
> system.time(DT[ , log(sum(V)), by = .(grp1,grp2),verbose=TRUE])
Detected that j uses these columns: V 
Finding groups using forderv ... forder.c received 100000 rows and 2 columns
0.700s elapsed (0.110s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 99974
0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'log(sum(V))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 69.031s for 99974 groups
  eval(j) took 0.176s for 99974 calls
00:01:09 elapsed (31.5s cpu) 
   user  system elapsed 
  31.64  105.61   70.11 
> 
> setDTthreads(1)
> system.time(DT[ , log(sum(V)), by = grp1,verbose=TRUE])
Detected that j uses these columns: V 
Finding groups using forderv ... forder.c received 100000 rows and 1 columns
0.040s elapsed (0.010s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'log(sum(V))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  memcpy contiguous groups took 0.004s for 25000 groups
  eval(j) took 0.013s for 25000 calls
0.020s elapsed (0.020s cpu) 
   user  system elapsed 
   0.03    0.03    0.06 
> system.time(DT[ , log(sum(V)), by = .(grp1,grp2),verbose=TRUE])
Detected that j uses these columns: V 
Finding groups using forderv ... forder.c received 100000 rows and 2 columns
0.050s elapsed (0.020s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
Getting back original order ... forder.c received a vector type 'integer' length 99974
0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as 'log(sum(V))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  collecting discontiguous groups took 3.048s for 99974 groups
  eval(j) took 0.031s for 99974 calls
3.140s elapsed (1.360s cpu) 
   user  system elapsed 
   1.38    1.81    3.19

# Output of sessionInfo()

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] extrafont_0.17    stargazer_5.2.2   lfe_2.8-5         Matrix_1.2-18     lmtest_0.9-37     sandwich_2.5-1    texreg_1.36.23    knitr_1.28       
 [9] kableExtra_1.1.0  gridExtra_2.3     ggridges_0.5.2    lubridate_1.7.4   scales_1.1.0      psych_1.9.12.31   zoo_1.8-7         forcats_0.4.0    
[17] stringr_1.4.0     dplyr_0.8.4       purrr_0.3.3       readr_1.3.1       tidyr_1.0.2       tibble_2.1.3      ggplot2_3.2.1     tidyverse_1.3.0  
[25] data.table_1.12.9

loaded via a namespace (and not attached):
 [1] httr_1.4.1        jsonlite_1.6.1    viridisLite_0.3.0 modelr_0.1.5      Formula_1.2-3     assertthat_0.2.1  highr_0.8         cellranger_1.1.0 
 [9] Rttf2pt1_1.3.8    pillar_1.4.3      backports_1.1.5   lattice_0.20-38   glue_1.3.1        extrafontdb_1.0   digest_0.6.23     rvest_0.3.5      
[17] colorspace_1.4-1  htmltools_0.4.0   plyr_1.8.5        pkgconfig_2.0.3   broom_0.5.4       haven_2.2.0       xtable_1.8-4      webshot_0.5.2    
[25] generics_0.0.2    withr_2.1.2       lazyeval_0.2.2    cli_2.0.1         mnormt_1.5-6      magrittr_1.5      crayon_1.3.4      readxl_1.3.1     
[33] evaluate_0.14     fs_1.3.1          fansi_0.4.1       nlme_3.1-143      xml2_1.2.2        tools_3.6.2       hms_0.5.3         lifecycle_0.1.0  
[41] munsell_0.5.0     reprex_0.3.0      compiler_3.6.2    rlang_0.4.4       grid_3.6.2        rstudioapi_0.10   rmarkdown_2.1     gtable_0.3.0     
[49] DBI_1.1.0         R6_2.4.1          stringi_1.4.4     parallel_3.6.2    Rcpp_1.0.3        vctrs_0.2.2       dbplyr_1.4.2      tidyselect_1.0.0 
[57] xfun_0.12

openmp

Source

nhirschey

Most helpful comment

I tested on a 64 core Windows PC. The performance is now approximately the same with 1 and 40 threads:

library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                grp2 = sample(5000L,NN,TRUE),
                V = rpois(NN, 10))
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#user  system elapsed 
#0.04    0.00    0.04 
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#user  system elapsed 
#0.03    0.00    0.03  

## Restart R

library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                grp2 = sample(5000L,NN,TRUE),
                V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#   0.03    0.00    0.03 
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.05    0.06    0.04

##
## Restart R
##

library(data.table)
N = 1e6L
set.seed(108)
d = data.table(id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))), # 9e5 unq values
               v1 = sample(5L, N, TRUE),
               v2 = sample(5L, N, TRUE))

setDTthreads(40L)
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.12    0.05    1.06 
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
      1       0       1 


setDTthreads(1L)
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.03    0.03    1.06 
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.03    0.00    1.03 

> Sys.getenv("NUMBER_OF_PROCESSORS")
[1] "64"
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
  [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] data.table_1.13.2

loaded via a namespace (and not attached):
  [1] compiler_4.0.2 tools_4.0.2

nhirschey on 27 Oct 2020

🚀1 👍1

All 10 comments

I suspect it is due to copying of all of the data to each of the 18 clusters

chinsoon12 on 10 Mar 2020

This is largely a duplicate of #4200. I am making some additional notes at that issue.

ColeMiller1 on 10 Mar 2020

👍1

@chinsoon12 AFAIK there is no data copying in openmp. The overhead is related to creating a team of threads for every single group.

jangorecki on 21 Apr 2020

👍1

This seems resolved by #4558 as timing is equal for 1 thread vs. 8 threads. Testing done via Windows 10 / R 4.0.2 / data.table 1.13.0:

library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                                 grp2 = sample(5000L,NN,TRUE),
                                 V = rpois(NN, 10))

setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
#>    user  system elapsed 
#>    0.06    0.03    0.08

setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#>    user  system elapsed 
#>    0.08    0.00    0.08

Feel free to re-open.

ColeMiller1 on 17 Oct 2020

@ColeMiller1 not sure it qualify to close in my case. I run twice reversing order to ensure warm-up is not affecting timing.

R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                                 grp2 = sample(5000L,NN,TRUE),
                                 V = rpois(NN, 10))
setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.295   0.009   0.063 
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#   0.04    0.00    0.04 
q("no")
R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                                 grp2 = sample(5000L,NN,TRUE),
                                 V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.049   0.000   0.048 
setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.351   0.000   0.061 
q("no")

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.0.2

jangorecki on 17 Oct 2020

another machine

R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                                 grp2 = sample(5000L,NN,TRUE),
                                 V = rpois(NN, 10))
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  1.131   0.000   0.069 
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.035   0.000   0.035 
q("no")
R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                                 grp2 = sample(5000L,NN,TRUE),
                                 V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.033   0.001   0.033 
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  1.132   0.007   0.061 
q("no")

R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_IN.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_IN.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3

jangorecki on 17 Oct 2020

Sorry about that, Jan.

That's interesting that this is now good for Windows but still bad for Linux. I ran twice as well but there was no real difference. It would be interesting if your major benchmark project could do cross-platform differences - not to error out but just to highlight these type of things.

ColeMiller1 on 17 Oct 2020

No worries. If you would have access to a 32-40 cores windows machine that would be useful to check it as well. IMO benchmarking on windows is low priority because this OS should not be used for any serious stuff other than gaming ;)

jangorecki on 17 Oct 2020

😄1

I can check on a high core count Windows machine sometime in the next few days.

nhirschey on 17 Oct 2020

👍1

I tested on a 64 core Windows PC. The performance is now approximately the same with 1 and 40 threads:

library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                grp2 = sample(5000L,NN,TRUE),
                V = rpois(NN, 10))
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#user  system elapsed 
#0.04    0.00    0.04 
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#user  system elapsed 
#0.03    0.00    0.03  

## Restart R

library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
                grp2 = sample(5000L,NN,TRUE),
                V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#   0.03    0.00    0.03 
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#   user  system elapsed 
#  0.05    0.06    0.04

##
## Restart R
##

library(data.table)
N = 1e6L
set.seed(108)
d = data.table(id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))), # 9e5 unq values
               v1 = sample(5L, N, TRUE),
               v2 = sample(5L, N, TRUE))

setDTthreads(40L)
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.12    0.05    1.06 
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
      1       0       1 


setDTthreads(1L)
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.03    0.03    1.06 
> system.time(d[, max(v1)-min(v2), by=id3])
   user  system elapsed 
   1.03    0.00    1.03 

> Sys.getenv("NUMBER_OF_PROCESSORS")
[1] "64"
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

locale:
  [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
  [1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
  [1] data.table_1.13.2

loaded via a namespace (and not attached):
  [1] compiler_4.0.2 tools_4.0.2

nhirschey on 27 Oct 2020

🚀1 👍1

Was this page helpful?

0 / 5 - 0 ratings