Data.table: implement guniqueN

Created on 16 Apr 2015  Â·  11Comments  Â·  Source: Rdatatable/data.table

Most recent data.table. Not always, but quite often...

library(data.table)
library(microbenchmark)
N <- 1e6
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 85.58602 85.58602 85.58602 85.58602 85.58602 85.58602     1
#                          DT[, uniqueN(x), y] 92.71877 92.71877 92.71877 92.71877 92.71877 92.71877     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 97.51024 97.51024 97.51024 97.51024 97.51024 97.51024     1
N <- 1e7
DT <- data.table(x = sample(1e5,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212 1642.5212     1
#                          DT[, uniqueN(x), y]  843.0670  843.0670  843.0670  843.0670  843.0670  843.0670     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  804.7881  804.7881  804.7881  804.7881  804.7881  804.7881     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 3.025365 3.025365 3.025365 3.025365 3.025365 3.025365     1
#                          DT[, uniqueN(x), y] 4.734323 4.734323 4.734323 4.734323 4.734323 4.734323     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 5.905721 5.905721 5.905721 5.905721 5.905721 5.905721     1
N <- 1e7
DT <- data.table(x = sample(1e3,N,TRUE), y = sample(1e5,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: seconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 2.906589 2.906589 2.906589 2.906589 2.906589 2.906589     1
#                          DT[, uniqueN(x), y] 4.731925 4.731925 4.731925 4.731925 4.731925 4.731925     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 7.084020 7.084020 7.084020 7.084020 7.084020 7.084020     1
N <- 1e7
DT <- data.table(x = sample(1e6,N,TRUE), y = sample(1e2,N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr      min       lq     mean   median       uq      max neval
#                   DT[, length(unique(x)), y] 1331.244 1331.244 1331.244 1331.244 1331.244 1331.244     1
#                          DT[, uniqueN(x), y]  998.040  998.040  998.040  998.040  998.040  998.040     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"] 1096.867 1096.867 1096.867 1096.867 1096.867 1096.867     1
N <- 1e7
DT <- data.table(x = sample(letters,N,TRUE), y = sample(letters[1:10],N,TRUE))
microbenchmark(times=1L,
               DT[, length(unique(x)),y],
               DT[, uniqueN(x),y],
               DT[, uniqueN(.SD), by="y", .SDcols="x"])
# Unit: milliseconds
#                                         expr       min        lq      mean    median        uq       max neval
#                   DT[, length(unique(x)), y] 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865 1304.4865     1
#                          DT[, uniqueN(x), y]  573.8628  573.8628  573.8628  573.8628  573.8628  573.8628     1
#  DT[, uniqueN(.SD), by = "y", .SDcols = "x"]  528.3269  528.3269  528.3269  528.3269  528.3269  528.3269     1

Related SO: http://stackoverflow.com/a/29684533/2490497

R version 3.1.3 (2015-03-09)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.2 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_DK.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5     microbenchmark_1.4-2

loaded via a namespace (and not attached):
 [1] bitops_1.0-6     chron_2.3-45     colorspace_1.2-4 devtools_1.7.0   digest_0.6.8     evaluate_0.5.5   formatR_1.0      ggplot2_1.0.0    grid_3.1.3      
[10] gtable_0.1.2     httr_0.6.1       knitr_1.8        MASS_7.3-37      munsell_0.4.2    plyr_1.8.1       proto_0.3-10     Rcpp_0.11.4      RCurl_1.95-4.5  
[19] reshape2_1.4.1   scales_0.2.4     stringr_0.6.2    tools_3.1.3    
GForce enhancement performance

Most helpful comment

Related #3395, #3438
Root of this problem is that uniqueN is called for every group. uniqueN calls forder which is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementing guniqueN function.
Additionally what we could do is to force calls in j which are not gfun to be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.

All 11 comments

Came looking for this. I run into this issue a lot - my recent case being unbearably slow. My case looks more like this

dt <- data.table(
  A=sample(100000, 1000000, replace=TRUE), 
  B=sample(100000, 1000000, replace=TRUE), 
  C=sample(1000000, 1000000, replace=TRUE)
)

# slow
system.time(result1 <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  12.132   0.038  12.178 

# fast
system.time(result2 <- dt[, list(1), keyby=list(A, B, C)][, list(UniqueCs=.N), keyby=list(A, B)])
#    user  system elapsed 
#   0.374   0.013   0.387 

I'd think uniqueN should take about as long as aggregating with its argument.

Confirming timings of @ben519...

Ran on 1.9.6:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#   8.032   0.004   8.029 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.496   0.004   0.498 

Ran on 1.9.7:

system.time(result <- dt[, list(UniqueCs=uniqueN(C)), keyby=list(A, B)])
#    user  system elapsed 
#  11.764   0.488   9.706 
system.time(result <- dt[, list(UniqueCs=.N), keyby=list(A, B, C)])
#    user  system elapsed 
#   0.100   0.008   0.109 

(I missed his edit, but the difference is marginal)

Ideal case where uniqueN is much faster than alternatives (list i.e. non-scalar input here)

https://stackoverflow.com/a/53890905/3576984

When used with with an by argument and many groups uniqueN() slowness is very bad:

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  times = 2
)
Unit: milliseconds
                                                expr        min         lq       mean     median         uq        max neval cld
        irisdt[, uniqueN(Sepal.Width), Sepal.Length] 3592.42280 3592.42280 3592.65173 3592.65173 3592.88065 3592.88065     2   b
 irisdt[, length(unique(Sepal.Width)), Sepal.Length]   73.84953   73.84953   79.74312   79.74312   85.63672   85.63672     2  a 

Using unique is the best way to go

irisdt <- setDT(iris[sample(1:150, size = 10000, replace = TRUE), ])
irisdt[, Sepal.Width := Sepal.Width+ sample(0:50, size = 10000, replace = TRUE)]
irisdt[, Sepal.Length:= Sepal.Width+ sample(0:5000, size = 10000, replace = TRUE)]


microbenchmark::microbenchmark(
  irisdt[, uniqueN(Sepal.Width), Sepal.Length],
  irisdt[, length(unique(Sepal.Width)), Sepal.Length],
  unique(irisdt, by = c('Sepal.Length', 'Sepal.Width'))[ , .N, by = Sepal.Length],
  times = 100
)
# Unit: milliseconds
#                                                                            expr        min         lq
#                                    irisdt[, uniqueN(Sepal.Width), Sepal.Length] 235.857762 284.023470
#                             irisdt[, length(unique(Sepal.Width)), Sepal.Length]  56.797016  70.049278
#  unique(irisdt, by = c("Sepal.Length", "Sepal.Width"))[, .N, by = Sepal.Length]   4.076486   4.652738
#        mean     median         uq       max neval
#  370.566000 328.691682 392.016670 968.17539   100
#   73.354643  72.797490  74.845989 130.11590   100
#    5.489569   4.801915   5.080524  55.50387   100

Or setkey(irisdt, Sepal.Width, Sepal.Length) ; irisdt[, .N, by = .(Sepal.Width, Sepal.Length)][ , .N, by = Sepal.Length] which will be faster than unique by about 30% and about X8 faster than length(unique())

But this seem irrelevant to the fact that uniqueN is about X70 slower than length(unique())

Not sure about github etiquette... should I reply? Anyway, I just wanted to point out that uniqueN() performs particularly bad in this setting which is ok but one has come to expect anything data.table to outperform anything in almost any setting. So maybe there is an issue here? My actual application is kind of different but I'm doing fine using uniqueN2 <- function(x) length(unique(x)) which also does much better than dplyr::n_distinct().

you're absolutely right that there's a problem with uniqueN & thanks for
the reproducible benchmark!

I just wanted to suggest valid alternatives in the meantime

On Thu, Feb 14, 2019, 10:53 PM Sindri <[email protected] wrote:

Not sure about github etiquette... should I reply? Anyway, I just wanted
to point out that uniqueN() performs particularly bad in this setting which
is ok but one has come to expect anything data.table to outperform anything
in almost any setting. So maybe there is an issue here? My actual
application is kind of different but I'm doing fine using uniqueN2 <-
function(x) length(unique(x)) which also does much better than
dplyr::n_distinct().

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/1120#issuecomment-463655708,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AHQQdX1lqkCantKRBGeOW0w7rvE82YQbks5vNXhMgaJpZM4ECSul
.

Related #3395, #3438
Root of this problem is that uniqueN is called for every group. uniqueN calls forder which is multithreaded, thus for every group own group of omp threads has to be formed. This will be resolved when implementing guniqueN function.
Additionally what we could do is to force calls in j which are not gfun to be single threaded (at least ours by locally setting DTthreads to 1) @mattdowle. That would "resolve" this and similar problems. Still might eventually result in slower performance if there are very few big groups.

another case where setting threads to 1 would probably help is new fifelse function: https://github.com/Rdatatable/data.table/commit/93cc9ab86c6dd8e116662135330fc3de0799426e

Was this page helpful?
0 / 5 - 0 ratings