This issue follows a discussion during useR!2019 after the presentation of data.table by Arun @arunsrinivasan
Hello,
Thanks for the amazing job, I love data.table !
I am using uniqueN to verify l-diversity for anonymization purposes.
The data I am working with is around 30M rows, easily ingested by data.table.
Unfortunately uniqueN is not as fast as other functions.
I tried to parallelize the grouping using setDTthreads as I can go up to 16 on my rstudio server instance.
First I get a benchmark using a simple sum over numeric.

Then I do basically the same thing but apply uniqueN over character (factor would give the same results).

Here is the code for a repex https://github.com/phileas-condemine/repex_slow_uniqueN/blob/master/repex_slow_uniqueN.R
Additional info :
> library(data.table)
data.table 1.12.2 using 8 threads (see ?getDTthreads).
Here is my session_info()
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8
[4] LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.0.0 data.table_1.12.2 RevoUtils_11.0.1 RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.1 rstudioapi_0.7 bindr_0.1.1 magrittr_1.5 tidyselect_0.2.4 munsell_0.5.0
[7] colorspace_1.3-2 R6_2.2.2 rlang_0.4.0 plyr_1.8.4 dplyr_0.7.6 tools_3.5.1
[13] grid_3.5.1 gtable_0.2.0 withr_2.1.2 lazyeval_0.2.1 assertthat_0.2.0 tibble_1.4.2
[19] crayon_1.3.4 bindrcpp_0.2.2 purrr_0.2.5 glue_1.3.0 compiler_3.5.1 pillar_1.3.0
[25] scales_0.5.0 pkgconfig_2.0.1
also lscpu call
> system("lscpu")
Architecture : x86_64
Mode(s) op茅ratoire(s) des processeurs : 32-bit, 64-bit
Boutisme : Little Endian
Processeur(s) : 16
Liste de processeur(s) en ligne : 0-15
Thread(s) par c艙ur : 1
C艙ur(s) par socket : 8
Socket(s) : 2
N艙ud(s) NUMA : 2
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Mod猫le : 62
Nom de mod猫le : Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
R茅vision : 4
Vitesse du processeur en MHz : 3291.927
BogoMIPS : 6584.13
Constructeur d'hyperviseur : VMware
Type de virtualisation : complet
Cache L1d : 32K
Cache L1i : 32K
Cache L2 : 256K
Cache L3 : 25600K
N艙ud NUMA 0 de processeur(s) : 0-7
N艙ud NUMA 1 de processeur(s) : 8-15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm kaiser arat
There are a couple of things here.
First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.
When you use .Ninj`
require(data.table)
foo <- function(n=3e8) {
card <- 3000
chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
dist <- runif(card)
DT <- data.table(
A=sample(chars, n, TRUE, dist),
B=sample(chars, n, TRUE, dist)
)
DT
}
set.seed(1L)
DT <- foo(5e7L)
DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
# B N
# 1: 71ee4 18647
# 2: b1718 31722
# 3: 2c1f3 33496
# 4: 13b3f 31041
# 5: 12132 19033
# ---
# 2994: 46635 20
# 2995: 5787a 23
# 2996: 7611f 57
# 2997: c30c6 39
# 2998: a8a2c 23
You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.
Similarly, we need to optimise uniqueN to be used along with by efficiently.
Second point: Even then, GForce isn't implemented for use with := yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extend GForce to also optimise :=.
When both these are done, things should speedup.
Until then, the best way to go about this (not benchmarked) would be:
unique(DT, by=c("A", "B"))[, .N, by=B]
I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.
@arunsrinivasan Second part of the title looks like a dupe of https://github.com/Rdatatable/data.table/issues/1414
First part is dupe of https://github.com/Rdatatable/data.table/issues/1120
I agree that both parts are dups. Closing this as it's clearly a dup. But would be nice to up the priority on this one since there seems to be some uniqueN performance related issues off late.
Most helpful comment
There are a couple of things here.
First point:
uniqueNis parallelised. But it's usage injis not optimised to useGForce.When you use .N
inj`You can see that the expression is optimised to use
GForce. This basically avoids having to switch evaluating the expression injbetween R and C which is quite inefficient.Similarly, we need to optimise
uniqueNto be used along withbyefficiently.Second point: Even then,
GForceisn't implemented for use with:=yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extendGForceto also optimise:=.When both these are done, things should speedup.
Until then, the best way to go about this (not benchmarked) would be:
I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original
data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.