Data.table: uniqueN could be GForce optimised + GForce could be optimised for := too.

Created on 24 Jul 2019  路  4Comments  路  Source: Rdatatable/data.table

This issue follows a discussion during useR!2019 after the presentation of data.table by Arun @arunsrinivasan

Hello,
Thanks for the amazing job, I love data.table !
I am using uniqueN to verify l-diversity for anonymization purposes.
The data I am working with is around 30M rows, easily ingested by data.table.
Unfortunately uniqueN is not as fast as other functions.
I tried to parallelize the grouping using setDTthreads as I can go up to 16 on my rstudio server instance.
First I get a benchmark using a simple sum over numeric.
simple_sum_DT

Then I do basically the same thing but apply uniqueN over character (factor would give the same results).

uniqueN_DT

Here is the code for a repex https://github.com/phileas-condemine/repex_slow_uniqueN/blob/master/repex_slow_uniqueN.R

Additional info :

> library(data.table)
data.table 1.12.2 using 8 threads (see ?getDTthreads).

Here is my session_info()

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)

Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
 [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.0.0        data.table_1.12.2    RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       rstudioapi_0.7   bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4 munsell_0.5.0   
 [7] colorspace_1.3-2 R6_2.2.2         rlang_0.4.0      plyr_1.8.4       dplyr_0.7.6      tools_3.5.1     
[13] grid_3.5.1       gtable_0.2.0     withr_2.1.2      lazyeval_0.2.1   assertthat_0.2.0 tibble_1.4.2    
[19] crayon_1.3.4     bindrcpp_0.2.2   purrr_0.2.5      glue_1.3.0       compiler_3.5.1   pillar_1.3.0    
[25] scales_0.5.0     pkgconfig_2.0.1 

also lscpu call

> system("lscpu")
Architecture :        x86_64
Mode(s) op茅ratoire(s) des processeurs : 32-bit, 64-bit
Boutisme :            Little Endian
Processeur(s) :       16
Liste de processeur(s) en ligne : 0-15
Thread(s) par c艙ur : 1
C艙ur(s) par socket : 8
Socket(s) :           2
N艙ud(s) NUMA :       2
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Mod猫le :             62
Nom de mod猫le :      Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
R茅vision :           4
Vitesse du processeur en MHz : 3291.927
BogoMIPS :            6584.13
Constructeur d'hyperviseur : VMware
Type de virtualisation : complet
Cache L1d :           32K
Cache L1i :           32K
Cache L2 :            256K
Cache L3 :            25600K
N艙ud NUMA 0 de processeur(s) : 0-7
N艙ud NUMA 1 de processeur(s) : 8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm kaiser arat
duplicate enhancement

Most helpful comment

There are a couple of things here.

First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.

When you use .Ninj`

require(data.table)
foo <- function(n=3e8) {
  card <- 3000
  chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
  dist <- runif(card)
  DT <- data.table(
    A=sample(chars, n, TRUE, dist), 
    B=sample(chars, n, TRUE, dist)
  )
  DT
}
set.seed(1L)
DT <- foo(5e7L)

DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
#           B     N
#    1: 71ee4 18647
#    2: b1718 31722
#    3: 2c1f3 33496
#    4: 13b3f 31041
#    5: 12132 19033
#   ---
# 2994: 46635    20
# 2995: 5787a    23
# 2996: 7611f    57
# 2997: c30c6    39
# 2998: a8a2c    23

You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.

Similarly, we need to optimise uniqueN to be used along with by efficiently.

Second point: Even then, GForce isn't implemented for use with := yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extend GForce to also optimise :=.

When both these are done, things should speedup.

Until then, the best way to go about this (not benchmarked) would be:

unique(DT, by=c("A", "B"))[, .N, by=B]

I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.

All 4 comments

There are a couple of things here.

First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.

When you use .Ninj`

require(data.table)
foo <- function(n=3e8) {
  card <- 3000
  chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
  dist <- runif(card)
  DT <- data.table(
    A=sample(chars, n, TRUE, dist), 
    B=sample(chars, n, TRUE, dist)
  )
  DT
}
set.seed(1L)
DT <- foo(5e7L)

DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
#           B     N
#    1: 71ee4 18647
#    2: b1718 31722
#    3: 2c1f3 33496
#    4: 13b3f 31041
#    5: 12132 19033
#   ---
# 2994: 46635    20
# 2995: 5787a    23
# 2996: 7611f    57
# 2997: c30c6    39
# 2998: a8a2c    23

You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.

Similarly, we need to optimise uniqueN to be used along with by efficiently.

Second point: Even then, GForce isn't implemented for use with := yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extend GForce to also optimise :=.

When both these are done, things should speedup.

Until then, the best way to go about this (not benchmarked) would be:

unique(DT, by=c("A", "B"))[, .N, by=B]

I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.

@arunsrinivasan Second part of the title looks like a dupe of https://github.com/Rdatatable/data.table/issues/1414

I agree that both parts are dups. Closing this as it's clearly a dup. But would be nice to up the priority on this one since there seems to be some uniqueN performance related issues off late.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

st-pasha picture st-pasha  路  3Comments

sbudai picture sbudai  路  3Comments

andschar picture andschar  路  3Comments

alex46015 picture alex46015  路  3Comments

DavidArenburg picture DavidArenburg  路  3Comments