Data.table: uniqueN could be GForce optimised + GForce could be optimised for := too.

Created on 24 Jul 2019 · 4Comments · Source: Rdatatable/data.table

This issue follows a discussion during useR!2019 after the presentation of data.table by Arun @arunsrinivasan

Hello,
Thanks for the amazing job, I love data.table !
I am using uniqueN to verify l-diversity for anonymization purposes.
The data I am working with is around 30M rows, easily ingested by data.table.
Unfortunately uniqueN is not as fast as other functions.
I tried to parallelize the grouping using setDTthreads as I can go up to 16 on my rstudio server instance.
First I get a benchmark using a simple sum over numeric.
simple_sum_DT

Then I do basically the same thing but apply uniqueN over character (factor would give the same results).

uniqueN_DT

Here is the code for a repex https://github.com/phileas-condemine/repex_slow_uniqueN/blob/master/repex_slow_uniqueN.R

Additional info :

> library(data.table)
data.table 1.12.2 using 8 threads (see ?getDTthreads).

Here is my session_info()

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)

Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
 [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.0.0        data.table_1.12.2    RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       rstudioapi_0.7   bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4 munsell_0.5.0   
 [7] colorspace_1.3-2 R6_2.2.2         rlang_0.4.0      plyr_1.8.4       dplyr_0.7.6      tools_3.5.1     
[13] grid_3.5.1       gtable_0.2.0     withr_2.1.2      lazyeval_0.2.1   assertthat_0.2.0 tibble_1.4.2    
[19] crayon_1.3.4     bindrcpp_0.2.2   purrr_0.2.5      glue_1.3.0       compiler_3.5.1   pillar_1.3.0    
[25] scales_0.5.0     pkgconfig_2.0.1

also lscpu call

> system("lscpu")
Architecture :        x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme :            Little Endian
Processeur(s) :       16
Liste de processeur(s) en ligne : 0-15
Thread(s) par cœur : 1
Cœur(s) par socket : 8
Socket(s) :           2
Nœud(s) NUMA :       2
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Modèle :             62
Nom de modèle :      Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
Révision :           4
Vitesse du processeur en MHz : 3291.927
BogoMIPS :            6584.13
Constructeur d'hyperviseur : VMware
Type de virtualisation : complet
Cache L1d :           32K
Cache L1i :           32K
Cache L2 :            256K
Cache L3 :            25600K
Nœud NUMA 0 de processeur(s) : 0-7
Nœud NUMA 1 de processeur(s) : 8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm kaiser arat

duplicate enhancement

Source

phileas-condemine

Most helpful comment

There are a couple of things here.

First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.

When you use .Ninj`

require(data.table)
foo <- function(n=3e8) {
  card <- 3000
  chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
  dist <- runif(card)
  DT <- data.table(
    A=sample(chars, n, TRUE, dist), 
    B=sample(chars, n, TRUE, dist)
  )
  DT
}
set.seed(1L)
DT <- foo(5e7L)

DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
#           B     N
#    1: 71ee4 18647
#    2: b1718 31722
#    3: 2c1f3 33496
#    4: 13b3f 31041
#    5: 12132 19033
#   ---
# 2994: 46635    20
# 2995: 5787a    23
# 2996: 7611f    57
# 2997: c30c6    39
# 2998: a8a2c    23

You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.

Similarly, we need to optimise uniqueN to be used along with by efficiently.

Second point: Even then, GForce isn't implemented for use with := yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extend GForce to also optimise :=.

When both these are done, things should speedup.

Until then, the best way to go about this (not benchmarked) would be:

unique(DT, by=c("A", "B"))[, .N, by=B]

I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.

arunsrinivasan on 24 Jul 2019

👍3

All 4 comments

There are a couple of things here.

First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.

When you use .Ninj`

require(data.table)
foo <- function(n=3e8) {
  card <- 3000
  chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
  dist <- runif(card)
  DT <- data.table(
    A=sample(chars, n, TRUE, dist), 
    B=sample(chars, n, TRUE, dist)
  )
  DT
}
set.seed(1L)
DT <- foo(5e7L)

DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
#           B     N
#    1: 71ee4 18647
#    2: b1718 31722
#    3: 2c1f3 33496
#    4: 13b3f 31041
#    5: 12132 19033
#   ---
# 2994: 46635    20
# 2995: 5787a    23
# 2996: 7611f    57
# 2997: c30c6    39
# 2998: a8a2c    23

You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.

Similarly, we need to optimise uniqueN to be used along with by efficiently.

When both these are done, things should speedup.

Until then, the best way to go about this (not benchmarked) would be:

unique(DT, by=c("A", "B"))[, .N, by=B]

arunsrinivasan on 24 Jul 2019

👍3

@arunsrinivasan Second part of the title looks like a dupe of https://github.com/Rdatatable/data.table/issues/1414

franknarf1 on 24 Jul 2019

First part is dupe of https://github.com/Rdatatable/data.table/issues/1120

jangorecki on 25 Jul 2019

I agree that both parts are dups. Closing this as it's clearly a dup. But would be nice to up the priority on this one since there seems to be some uniqueN performance related issues off late.

arunsrinivasan on 31 Aug 2019

Was this page helpful?

0 / 5 - 0 ratings