I'm using the lastest dev version of data.table. Still uniqueN() is an order of magnitude slower than length(unique()) . So slow I think it should be tagged as a bug ... See the reprex example below.
Note, the below code uses 4 threads (on a Win7 computer). If we set the thread number to 1, the time cost will be reduced to a half but still significantly slower than length(unique()).
(In fact, the reason I notice this is because I have a daily routine script costs maybe 20 minutes... and trying to improve the speed leads me to the cause - uniqueN())
library(data.table)
set.seed(1000)
mk_rd_words <- function(min = 4, max = 20) {
n <- floor(runif(1, min, max))
paste0(sample(c(letters, LETTERS), size = n, replace = TRUE), collapse = '')
}
words <- vapply(1:1000, function(x) mk_rd_words(4, 50), FUN.VALUE = 'a')
n <- 1e4
tbl <- data.table(
a = sample(words, size = n, replace = TRUE),
b = sample(words, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
times = 100,
tbl[, .(N = uniqueN(b)), keyby = a],
tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#> expr min lq
#> tbl[, .(N = uniqueN(b)), keyby = a] 169.13260 171.651133
#> tbl[, .(N = length(unique(b))), keyby = a] 8.12066 8.607032
#> mean median uq max neval
#> 176.649940 173.972808 181.373316 201.70846 100
#> 9.233874 8.738746 9.208014 16.28779 100
Created on 2019-08-02 by the reprex package (v0.2.1)
library(data.table)
set.seed(1000)
n <- 1e4
tbl <- data.table(
a = sample(1:1e3, size = n, replace = TRUE),
b = sample(1:1e3, size = n, replace = TRUE)
)
microbenchmark::microbenchmark(
times = 100,
tbl[, .(N = uniqueN(b)), keyby = a],
tbl[, .(N = length(unique(b))), keyby = a]
)
#> Unit: milliseconds
#> expr min lq
#> tbl[, .(N = uniqueN(b)), keyby = a] 107.329319 111.531912
#> tbl[, .(N = length(unique(b))), keyby = a] 5.777745 5.980306
#> mean median uq max neval
#> 119.497038 115.412347 124.303334 158.04780 100
#> 6.992722 6.314294 7.678619 13.08759 100
Created on 2019-08-02 by the reprex package (v0.2.1)
Well, I see. For very large vectors, uniqueN() is faster. The problem is if uniqueN() is supposed to be called millions of times (e.g. there're many subgroups), the performance's slowing down is significant.
Maybe for small atomic vector, we just use length(unique()) instead?
(Uses 4 thread on Mac OSX)
library(data.table)
set.seed(1000)
# small vector, uniqueN() is slower
x <- sample(1:1e5, size = 1e2, replace = TRUE)
microbenchmark::microbenchmark(
uniqueN(x),
length(unique(x))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> uniqueN(x) 20.744 21.3460 32.05373 21.7050 22.226 1010.481 100
#> length(unique(x)) 4.757 5.3565 6.18967 5.9215 6.698 18.199 100
# big vector, uniqueN() is faster
x <- sample(1:1e5, size = 1e6, replace = TRUE)
microbenchmark::microbenchmark(
uniqueN(x),
length(unique(x))
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> uniqueN(x) 9.176062 10.31124 11.55733 10.80676 12.44610 16.72396
#> length(unique(x)) 20.834950 22.74216 26.23369 25.77416 27.84571 43.23249
#> neval
#> 100
#> 100
Created on 2019-08-02 by the reprex package (v0.2.1)
it make sense to include number of cores in those timings
Thanks for the remind. It uses 4 threads and I've included the thread number in my original post above.
Another example, perhaps also needing "n-cores" expansion: https://stackoverflow.com/questions/60623235/r-why-dplyr-counts-unique-values-n-distinct-by-groups-faster-than-data-table
The same happens to me.
These two lines produce the same result on my data (with 100000 different IDs and around 5 rows within each ID).
DT[,if(.N==uniqueN(test)) .SD ,by=ID]
DT[,c(.SD,.N),by=.(ID,test)][,if(!any(N>1)) .SD, by=ID][,-"N"]
But the former is 10 times slower.
@skanskan can you confirm that timing's on current master? Since there was just a related commit yesterday. My understanding is the timing should be better (though probably not yet equivalent)
Right now I'm using the stable version because I'm doing calculations for an article I didn't want to mess with versions. But I'm going to try the master version.
OK, I have just compared data.table 1.12.8 vs 1.12.9
Windows 10 x64. i7 7700HQ 8 cores. 16GB RAM.
DT[,if(.N==uniqueN(test)) .SD ,by=ID] improved from 44.3s to 17.5s
but
DT[,c(.SD,.N),by=.(ID,test)][,if(!any(N>1)) .SD, by=ID][,-"N"] now is much slower, went from 1.9s to 6.2s. The slow part seems to be [,if(!any(N>1)) .SD, by=ID].
I have also tried
DT[,if(!anyDuplicated(test)) .SD ,by=ID] which was 1.3s and now is 1.4s.
DT[,if(.N==length(unique(test))) .SD ,by=ID] is 1.6s.
@shrektan any idea what was the version of DT when you reported this issue?
@skanskan j = if (...) .SD looks to be a different issue. Could you please fill up a new issue presenting regression?
1.12.8
Unit: milliseconds
expr min lq mean
tbl[, .(N = uniqueN(b)), keyby = a] 22.487731 23.279140 26.204391
tbl[, .(N = length(unique(b))), keyby = a] 6.958176 7.077496 8.084379
median uq max neval
24.351137 25.483656 54.6224 100
7.321736 8.482808 21.8727 100
master
Unit: milliseconds
expr min lq mean
tbl[, .(N = uniqueN(b)), keyby = a] 18.444103 19.182902 20.200975
tbl[, .(N = length(unique(b))), keyby = a] 6.995607 7.304626 7.820315
median uq max neval
19.83011 20.730743 40.06141 100
7.48119 8.519765 10.77755 100
4 cores / 50%
@jangorecki I think I was using the dev version of data.table at time of writing. So it should be version 1.12.3 https://github.com/Rdatatable/data.table/commit/2986736264ac4d776133b19b9a9872519b58879a
BTW, I get basically the same result as you do using version 1.13.0 - still slower (2-3 times) but not that much (10 - 30 times) as it was in 1.12.3 .
Most helpful comment
Another example, perhaps also needing "n-cores" expansion: https://stackoverflow.com/questions/60623235/r-why-dplyr-counts-unique-values-n-distinct-by-groups-faster-than-data-table