Most recent data.table. Not always, but quite often...
library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602 1
# DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024 1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1
# DT[, uniqueN(x), y] 843.0670 843.0670 843.0670 843.0670 843.0670 843.0670 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 804.7881 804.7881 804.7881 804.7881 804.7881 804.7881 1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365 1
# DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721 1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589 1
# DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020 1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244 1
# DT[, uniqueN(x), y] 998.040 998.040 998.040 998.040 998.040 998.040 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867 1
N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
DT[, length(unique(x)),y],
DT[, uniqueN(x),y],
DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
# expr min lq mean median uq max neval
# DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1
# DT[, uniqueN(x), y] 573.8628 573.8628 573.8628 573.8628 573.8628 573.8628 1
# DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 528.3269 528.3269 528.3269 528.3269 528.3269 528.3269 1
Related SO: http://stackoverflow.com/a/29684533/2490497
R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_DK.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.5 microbenchmark_1.4-2
loaded via a namespace (and not attached):
[1] bitops_1.0-6 chron_2.3-45 colorspace_1.2-4 devtools_1.7.0 digest_0.6.8 evaluate_0.5.5 formatR_1.0 ggplot2_1.0.0 grid_3.1.3
[10] gtable_0.1.2 httr_0.6.1 knitr_1.8 MASS_7.3-37 munsell_0.4.2 plyr_1.8.1 proto_0.3-10 Rcpp_0.11.4 RCurl_1.95-4.5
[19] reshape2_1.4.1 scales_0.2.4 stringr_0.6.2 tools_3.1.3
Came looking for this. I run into this issue a lot - my recent case being unbearably slow. My case looks more like this
dt <- data.table(
A=sample(100000, 1000000, replace=TRUE),
B=sample(100000, 1000000, replace=TRUE),
C=sample(1000000, 1000000, replace=TRUE)
)
# slow
system.time(result1 <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
# user system elapsed
# 12.132 0.038 12.178
# fast
system.time(result2 <- dt[, list(1), keyby=list(A, B, C)][, list(UniqueCs=.N), keyby=list(A, B)])
# user system elapsed
# 0.374 0.013 0.387
I'd think uniqueN should take about as long as aggregating with its argument.
Confirming timings of @ben519...
Ran on 1.9.6:
system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
# user system elapsed
# 8.032 0.004 8.029
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
# user system elapsed
# 0.496 0.004 0.498
Ran on 1.9.7:
system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
# user system elapsed
# 11.764 0.488 9.706
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
# user system elapsed
# 0.100 0.008 0.109
(I missed his edit, but the difference is marginal)
Update if improved:
Ideal case where uniqueN is much faster than alternatives (list i.e. non-scalar input here)
When used with with an by argument and many groups uniqueN() slowness is very bad:
irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]
microbenchmark::microbenchmark(
irisdt[, uniqueN(Sepal.Width), Sepal.Length],
irisdt[, length(unique(Sepal.Width)), Sepal.Length],
times = 2
)
Unit: milliseconds
expr min lq mean median uq max neval cld
irisdt[, uniqueN(Sepal.Width), Sepal.Length] 3592.42280 3592.42280 3592.65173 3592.65173 3592.88065 3592.88065 2 b
irisdt[, length(unique(Sepal.Width)), Sepal.Length] 73.84953 73.84953 79.74312 79.74312 85.63672 85.63672 2 a
Using unique is the best way to go
irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]
microbenchmark::microbenchmark(
irisdt[, uniqueN(Sepal.Width), Sepal.Length],
irisdt[, length(unique(Sepal.Width)), Sepal.Length],
unique(irisdt, by = c('Sepal.Length', 'Sepal.Width'))[ , .N, by = Sepal.Length],
times = 100
)
# Unit: milliseconds
# expr min lq
# irisdt[, uniqueN(Sepal.Width), Sepal.Length] 235.857762 284.023470
# irisdt[, length(unique(Sepal.Width)), Sepal.Length] 56.797016 70.049278
# unique(irisdt, by = c("Sepal.Length", "Sepal.Width"))[, .N, by = Sepal.Length] 4.076486 4.652738
# mean median uq max neval
# 370.566000 328.691682 392.016670 968.17539 100
# 73.354643 72.797490 74.845989 130.11590 100
# 5.489569 4.801915 5.080524 55.50387 100
Or setkey(irisdt, Sepal.Width, Sepal.Length) ; irisdt[, .N, by = .(Sepal.Width, Sepal.Length)][ , .N, by = Sepal.Length] which will be faster than unique by about 30% and about X8 faster than length(unique())
But this seem irrelevant to the fact that uniqueN is about X70 slower than length(unique())
Not sure about github etiquette... should I reply? Anyway, I just wanted to point out that uniqueN() performs particularly bad in this setting which is ok but one has come to expect anything data.table to outperform anything in almost any setting. So maybe there is an issue here? My actual application is kind of different but I'm doing fine using uniqueN2 <- function(x) length(unique(x)) which also does much better than dplyr::n_distinct().
you're absolutely right that there's a problem with uniqueN & thanks for
the reproducible benchmark!
I just wanted to suggest valid alternatives in the meantime
On Thu, Feb 14, 2019, 10:53 PM Sindri <[email protected] wrote:
Not sure about github etiquette... should I reply? Anyway, I just wanted
to point out that uniqueN() performs particularly bad in this setting which
is ok but one has come to expect anything data.table to outperform anything
in almost any setting. So maybe there is an issue here? My actual
application is kind of different but I'm doing fine using uniqueN2 <-
function(x) length(unique(x)) which also does much better than
dplyr::n_distinct().—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463655708,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdX1lqkCantKRBGeOW0w7rvE82YQbks5vNXhMgaJpZM4ECSul
.
Related #3395, #3438
Root of this problem is that uniqueN is called for every group. uniqueN calls forder which is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementing guniqueN function.
Additionally what we could do is to force calls in j which are not gfun to be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.
another case where setting threads to 1 would probably help is new fifelse function: https://github.com/Rdatatable/data.table/commit/93cc9ab86c6dd8e116662135330fc3de0799426e
Most helpful comment
Related #3395, #3438
Root of this problem is that
uniqueNis called for every group.uniqueNcallsforderwhich is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementingguniqueNfunction.Additionally what we could do is to force calls in
jwhich are notgfunto be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.