I noticed that merging data.tables on a character column is surprisingly slow (compared to dplyr). See the reprex below.
Am I missing something obvious or is there room for improvement? Don't get me wrong: I love data.table and use it daily, but I was very surprised that this kind of merge is so comparatively slow.
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 1774.640 1774.640 1774.640 1774.640 1774.640 1774.640 1
#> dplyr 35.852 35.852 35.852 35.852 35.852 35.852 1
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 6.3679 6.3679 6.3679 6.3679 6.3679 6.3679 1
#> dplyr 56.9153 56.9153 56.9153 56.9153 56.9153 56.9153 1
dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
dplyr = inner_join(dfa_int, dfb_int, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 5.4953 5.4953 5.4953 5.4953 5.4953 5.4953 1
#> dplyr 14.6955 14.6955 14.6955 14.6955 14.6955 14.6955 1
Created on 2020-10-02 by the reprex package (v0.3.0)
sessionInfo()devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.2)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.2)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 3.6.2)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.2)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.2)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.3)
#> rlang 0.4.6 2020-05-02 [1] CRAN (R 3.6.3)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
#> stringi 1.4.4 2020-01-09 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.3)
#> vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Program Files/R/library
PS: Not sure if this is already adressed by #4538.
Interestingly, this issue does not occur on my Linux machine (see below). I'll investigate further.
library(data.table)
library(dplyr)
#>
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 34.59378 34.59378 34.59378 34.59378 34.59378 34.59378 1
#> dplyr 34.98623 34.98623 34.98623 34.98623 34.98623 34.98623 1
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = merge.data.table(dfa_char, dfb_char, by = "id", sort = FALSE),
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 3.570949 3.570949 3.570949 3.570949 3.570949 3.570949 1
#> dplyr 14.952365 14.952365 14.952365 14.952365 14.952365 14.952365 1
dfa_int <- data.table(id = seq_len(10000))
dfb_int <- data.table(id = seq_len(100000), val = rnorm(100000))
microbenchmark(
data.table = merge.data.table(dfa_int, dfb_int, by = "id", sort = FALSE),
dplyr = inner_join(dfa_int, dfb_int, by = "id"),
times = 1L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 2.410635 2.410635 2.410635 2.410635 2.410635 2.410635 1
#> dplyr 8.584089 8.584089 8.584089 8.584089 8.584089 8.584089 1
Created on 2020-10-02 by the reprex package (v0.3.0)
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os elementary OS 5.1.7 Hera
#> system x86_64, linux-gnu
#> ui X11
#> language de_DE
#> collate de_DE.UTF-8
#> ctype de_DE.UTF-8
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.18 2020-09-29 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/matthias/R/x86_64-pc-linux-gnu-library/4.0
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
Could you use times=2? It may be caused by https://github.com/Rdatatable/data.table/issues/2912 which on Windows can be even more severe.
Also note that merge.data.table just wraps around [ so for benchmarking the latter should be preferred.
dfa_char[dfb_char, on = "id", nomatch = NULL]
Reinstalling all packages did not have an effect on Windows. Is this related to the old R version (3.6.1)?
Reinstalling won't make difference.
Did you try times=2L, does the timings differ?
Also note that there is an open issue that your issue would probably be a duplicate of https://github.com/Rdatatable/data.table/issues/3928
Sorry, somehow missed your first comment before. Here are the benchmarks with times = 2:
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.3
library(dplyr)
#> Warning: package 'dplyr' was built under R version 3.6.3
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:data.table':
#>
#> between, first, last
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(microbenchmark)
#> Warning: package 'microbenchmark' was built under R version 3.6.2
n = 2L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 1698.4893 1698.4893 1758.1653 1758.1653 1817.8414 1817.8414 2
#> dplyr 41.1121 41.1121 44.3698 44.3698 47.6275 47.6275 2
Created on 2020-10-02 by the reprex package (v0.3.0)
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate German_Germany.1252
#> ctype German_Germany.1252
#> tz Europe/Berlin
#> date 2020-10-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.2)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 3.6.3)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 3.6.3)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.2)
#> data.table * 1.13.0 2020-07-24 [1] CRAN (R 3.6.3)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.2)
#> devtools 2.3.0 2020-04-10 [1] CRAN (R 3.6.3)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 3.6.3)
#> dplyr * 1.0.0 2020-05-29 [1] CRAN (R 3.6.3)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 3.6.2)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 3.6.2)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 3.6.3)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.2)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.2)
#> knitr 1.28 2020-02-06 [1] CRAN (R 3.6.3)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.3)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.2)
#> microbenchmark * 1.4-7 2019-09-24 [1] CRAN (R 3.6.2)
#> pillar 1.4.4 2020-05-05 [1] CRAN (R 3.6.3)
#> pkgbuild 1.0.8 2020-05-07 [1] CRAN (R 3.6.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.2)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.2)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.2)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.2)
#> purrr 0.3.3 2019-10-18 [1] CRAN (R 3.6.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.2)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 3.6.3)
#> remotes 2.1.1 2020-02-15 [1] CRAN (R 3.6.3)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 3.6.3)
#> rmarkdown 2.1 2020-01-20 [1] CRAN (R 3.6.3)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.2)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 3.6.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 3.6.3)
#> tibble 3.0.1 2020-04-20 [1] CRAN (R 3.6.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.3)
#> usethis 1.6.0 2020-04-09 [1] CRAN (R 3.6.3)
#> vctrs 0.3.1 2020-06-05 [1] CRAN (R 3.6.3)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.2)
#> xfun 0.12 2020-01-13 [1] CRAN (R 3.6.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.2)
#>
#> [1] C:/Program Files/R/library
I reinstalled the packages because I once had an issue with packages compiled for another R version...
I have Windows 10 / data.table 1.13.0 / R 4.0.2. dplyr is somewhat faster for character fields on my machine but not the large difference that @Plebejer is experiencing.
library(data.table)
library(dplyr)
library(microbenchmark)
n = 10L # run benchmarks twice
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> data.table 55.1987 59.1335 64.47342 60.95335 63.0049 86.6976 10
#> dplyr 36.0711 36.8935 43.36965 42.81990 48.2768 54.2212 10
I also tried it with varying threads between 1-8. There were no large effects on timings.
According to my test:
Using setkey on both dfa and dfb changes the dynamic and makes data.table about 3x faster than dplyr on macOS.
library(data.table)
library(dplyr)
library(microbenchmark)
n = 2L
dfa_char <- data.table(id = as.character(seq_len(10000)))
dfb_char <- data.table(id = as.character(seq_len(100000)), val = rnorm(100000))
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 36.99652 36.99652 42.68058 42.68058 48.36465 48.36465 2
# dplyr 16.11978 16.11978 29.98812 29.98812 43.85645 43.85645 2
setkey(dfa_char, id)
setkey(dfb_char, id)
microbenchmark(
data.table = dfa_char[dfb_char, on = "id", nomatch = NULL],
dplyr = inner_join(dfa_char, dfb_char, by = "id"),
times = n
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 5.618789 5.618789 6.102372 6.102372 6.585954 6.585954 2
# dplyr 15.181681 15.181681 17.706647 17.706647 20.231614 20.231614 2
sessionInfoR version 4.0.2 Patched (2020-09-21 r79235)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-7 dplyr_1.0.0 data.table_1.13.1
[4] pacman_0.5.1
loaded via a namespace (and not attached):
[1] crayon_1.3.4 R6_2.4.1 lifecycle_0.2.0 magrittr_1.5
[5] pillar_1.4.6 rlang_0.4.7 vctrs_0.3.2 generics_0.0.2
[9] ellipsis_0.3.1 glue_1.4.1 purrr_0.3.4 compiler_4.0.2
[13] pkgconfig_2.0.3 tidyselect_1.1.0 tibble_3.0.3
It is because we don't have to find order of a character column anymore. Issue I linked describes the problem. I might have found a root cause of it already and shared that to Matt over slack. If we will not have it resolved for the release that is coming now, then I hope the one after will have this fix.
I described root cause in linked issue. While I think this issue is a duplicate of the other one, it is more safe to keep it open to ensure it will be resolved as well.
Also interesting finding by @shrektan is that R must have improved character handling on windows in R 4.0.
Most helpful comment
According to my test: