The setkey function is much slower in all versions of data.table from 1.12.0.
Context: I manipulate large datasets (~50 millions rows, 50 columns) with data.table on a daily basis. I work with three different computers : an old legacy Windows 2008 server, a more Windows 10 recent server, and my local computer. The available versions of R and data.table differ significantly in each setting.
Problem: I noticed several times that the speed of the setkey function varies considerably depending on the setting I work in : for one of the datasets I work with (54 millions rows with a key uniquely identifying each row), the setkey call may take 2 seconds or 13 minutes.
To make sure where this came from, I ran the same code with several versions of data.table, from 1.10.4 to 1.13.2 in the three settings. The code and all sessions info are below. I found every time the same result : the versions older than or equal to 1.11.8 are very fast, and later versions are much slower (approximately from 200 to 400 times).
In this table, I put the results of the execution time of setkey on a fake dataset (5 millions rows), measured with system.time().
| | R version | data.table version | user time | system time | elapsed time |
|----------------|-----------|------------|--------|--------|---------|
| New server | | | | | |
| | 3.6.3 | 1.10.4.3 | 5.72 | 0.15 | 5.73 |
| | 3.6.3 | 1.11.0 | 5.52 | 0.15 | 5.53 |
| | 3.6.3 | 1.11.2 | 5.54 | 0.14 | 5.55 |
| | 3.6.3 | 1.11.4 | 5.64 | 0.11 | 5.61 |
| | 3.6.3 | 1.11.6 | 5.48 | 0.19 | 5.55 |
| | 3.6.3 | 1.11.8 | 5.54 | 0.23 | 5.64 |
| | 3.6.3 | 1.12.0 | 11.67 | 19.15 | 76.69 |
| | 3.6.3 | 1.12.8 | 8.58 | 10.78 | 65.73 |
| | 3.6.3 | 1.13.2 | 10.55 | 24.20 | 70.95 |
| Legacy server | | | | | |
| | 3.3.3 | 1.10.4.3 | 3.57 | 0.08 | 3.52 |
| | 3.3.3 | 1.11.0 | 3.61 | 0.07 | 3.56 |
| | 3.3.3 | 1.11.2 | 3.37 | 0.17 | 3.42 |
| | 3.3.3 | 1.11.4 | 3.55 | 0.03 | 3.47 |
| | 3.3.3 | 1.11.6 | 3.45 | 0.06 | 3.38 |
| | 3.3.3 | 1.11.8 | 3.46 | 0.06 | 3.43 |
| | 3.3.3 | 1.12.0 | 8.47 | 16.19 | 39.59 |
| | 3.3.3 | 1.12.8 | 8.81 | 16.91 | 39.54 |
| | 3.3.3 | 1.13.2 | 8.14 | 13.61 | 39.49 |
| Local computer | | | | | |
| | 3.5.3 | 1.10.4.3 | 4.15 | 0.22 | 4.29 |
| | 3.5.3 | 1.11.0 | 4.28 | 0.04 | 4.17 |
| | 3.5.3 | 1.11.2 | 4.49 | 0.06 | 4.62 |
| | 3.5.3 | 1.11.4 | 4.35 | 0.12 | 4.40 |
| | 3.5.3 | 1.11.6 | 4.29 | 0.08 | 4.26 |
| | 3.5.3 | 1.11.8 | 4.21 | 0.06 | 4.21 |
| | 3.5.3 | 1.12.0 | 13.34 | 17.22 | 31.16 |
| | 3.5.3 | 1.12.8 | 12.92 | 15.73 | 28.62 |
| | 3.5.3 | 1.13.2 | 12.77 | 16.28 | 29.02 |
This code installs several versions of data.table in separate libraries, and measures the execution time of setkey on an artificial dataset.
# The directory you use for the tests
test_dir <- # Do not forget to define the directory
# Keep the R version for the final table
Rversion <- paste0(R.version$major, ".", R.version$minor)
# All the versions we will test
versions <- list(
c(paste0("R", Rversion, "-", "dt1-10-4"), "1.10.4-3"),
c(paste0("R", Rversion, "-", "dt1-11-0"), "1.11.0"),
c(paste0("R", Rversion, "-", "dt1-11-2"), "1.11.2"),
c(paste0("R", Rversion, "-", "dt1-11-4"), "1.11.4"),
c(paste0("R", Rversion, "-", "dt1-11-6"), "1.11.6"),
c(paste0("R", Rversion, "-", "dt1-11-8"), "1.11.8"),
c(paste0("R", Rversion, "-", "dt1-12-0"), "1.12.0"),
c(paste0("R", Rversion, "-", "dt1-12-8"), "1.12.8"),
c(paste0("R", Rversion, "-", "dt1-13-2"), "1.13.2")
)
#############################
# Part 1: installing all data.table versions
# Function installing all versions of data.table in separate temporary libraries
install_version_dt <- function(infos) {
package_lib <- paste0(test_dir, infos[1])
package_version <- infos[2]
# Create temporary library
try(unlink(package_lib, recursive = TRUE))
dir.create(package_lib)
# Install package version
devtools::install_version("data.table", version = package_version, lib = package_lib)
try(unloadNamespace(data.table))
}
# Install all data.table versions
lapply(versions, install_version_dt)
#############################
# Part 2: measuring execution time of setkey
# Function measuring the execution time of setkey on artificial data
# with different versions of data.table
test_version_dt <- function(package_version, Rversion) {
# Keep the old library paths
old_libpath <- .libPaths()
adresse_lib <- paste0(test_dir, package_version)
.libPaths(adresse_lib)
print(packageVersion("data.table"))
dt_version <- packageVersion("data.table")
library("data.table")
print(.libPaths())
set.seed(1L)
dt <- data.table::data.table(
x = as.character(sample(5e6L, 5e6L, FALSE)),
y = runif(100L))
results <- system.time(
{
data.table::setkey(dt, x, verbose = TRUE)
}
)
# Make sure we unload the package
try(unloadNamespace(data.table))
try(detach('package:data.table', unload = TRUE))
# Restore the old library paths
.libPaths(old_libpath)
print(.libPaths())
return(
list(
"Rversion" = Rversion,
"dt_version" = as.character(dt_version),
"user.self" = as.numeric(results["user.self"]),
"sys.self" = as.numeric(results["sys.self"]),
"elapsed" = as.numeric(results["elapsed"])))
}
# make the list of all temporary libraries
folder_list <-
c(
unlist(lapply(versions, function(x) return(x[1])))
)
results_list <- lapply(folder_list, test_version_dt, Rversion)
#############################
# Part 3: summarizing results
results_df <- data.table::rbindlist(results_list)
print(results_df)
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_3.3.3
R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252
[4] LC_NUMERIC=C LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.6.3 parallel_3.6.3 tools_3.6.3 Rcpp_1.0.5 fst_0.9.4
R version 3.5.3 (2019-03-11)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
[3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.3 rstudioapi_0.11 magrittr_1.5 usethis_1.5.1
[5] devtools_2.2.2 pkgload_1.0.2 R6_2.4.1 rlang_0.4.6
[9] fansi_0.4.1 tools_3.5.3 pkgbuild_1.0.6 data.table_1.13.2
[13] sessioninfo_1.1.1 cli_2.0.2 withr_2.1.2 ellipsis_0.3.1
[17] remotes_2.1.1 yaml_2.2.1 assertthat_0.2.1 digest_0.6.25
[21] rprojroot_1.3-2 crayon_1.3.4 processx_3.4.2 callr_3.4.2
[25] fs_1.3.1 ps_1.3.2 testthat_2.3.1 memoise_1.1.0
[29] glue_1.4.1 compiler_3.5.3 desc_1.2.0 backports_1.1.5
[33] prettyunits_1.1.1
OK, on Windows... Let me guess: does your data contain lots of non-ASCII strings? If so, have you tried to convert them to UTF-8 encoded first? If not, would you mind convert it to UTF8 encoded first (you may use my function below) and try again?
set_utf8_dt <- function(x) {
stopifnot(data.table::is.data.table(x))
key <- data.table::key(x)
cols <- colnames(x)
cols_str <- cols[vapply(x, is.character, logical(1L))]
for (col in cols_str) {
data.table::set(x, i = NULL, j = col, value = enc2utf8(x[[col]]))
}
data.table::setnames(x, cols, enc2utf8(cols))
if (!is.null(key)) data.table::setkeyv(x, enc2utf8(key))
invisible(x[])
}
Thank you for your detailed and well prepared report.
Performance regression of setkey (internally forder) on character vector is a known issue. It was initially identified in #3928 and later in #4733.
Not closing this as a duplicate because it has very useful code.
Moreover I would like to also see if @shrektan suggestions made any difference.
@shrektan : no, my data does not contain non-ASCII strings. As you can see in my report, I generated artificial data made only of integers and doubles, so I don't think that the problem comes from encoding problems.
@jangorecki : thank you, and sorry if this issue is a duplicate, I'm not familiar with the data.table repository. You can close it if you think it's appropriate.
After submitting the issue, I dived into the source files and found that the forder function was modified several times between versions 1.11.8 and 1.12.0. As a matter of fact, I just discovered the verbose option of setkey (I edited the code above to add it). Rerunning the code with this option, it becomes clear that the problem comes from forder being less performant than before.
My bad, I didn't notice that you included the data code, which is very nice :D
Unfortunately, I can't reproduce your result by using the following code on OSX, R4.0.3
library(data.table)
setDTthreads(4L) # use 1L or 4L to test if it's affected by the cores
set.seed(1L)
dt <- data.table::data.table(
x = as.character(sample(5e6L, 5e6L, FALSE)),
y = runif(100L))
system.time(
data.table::setkey(dt, x, verbose = TRUE)
)
Below are my results against v1.11.8 and the current dev version of data.table:
forder took 3.352 sec
reorder took 0.197 sec
user system elapsed
4.557 0.053 4.524
forder took 3.317 sec
reorder took 0.153 sec
user system elapsed
4.541 0.028 4.568
forder.c received 5000000 rows and 2 columns
forder took 7.14 sec
reorder took 0.069s elapsed (0.248s cpu)
user system elapsed
7.826 0.108 4.223
forder.c received 5000000 rows and 2 columns
forder took 3.514 sec
reorder took 0.135s elapsed (0.134s cpu)
user system elapsed
4.138 0.049 4.191
Maybe a Windows only issue?
Well, I still can't reproduce your results on Windows 10 x64, R4.0.1, with data.table v1.11.8 and the current dev version. The elapsed time is very close...
Note, I build the both versions of data.table from source and I don't know if this affects or not.
@shrektan building from source vs pre-compiled binaries can impact performance. Don't know how on windows but on linux some compiler flags can control that, like -mtune=native.
@oliviermeslin could you paste following output?
readLines(system.file("cc", package="data.table"))
It gives the following output: "CC=gcc -std=gnu99" "CFLAGS=-O3". No idea what it means :smile:
@oliviermeslin These are compilation flags that compiler, gcc in this case, used when translating C code into machine code. What could be helpful if you could install 1.13.2 from source and check if there is difference in performance.
You may also add -mtune=native flag for compiler. This tells to compiler to optimize code for the current machine, which cannot be done when binaries are compiled on a different machine, like on CRAN.
To add this flag just create ~/.R/Makevars file having following content
CC=gcc
CFLAGS=-O3 -mtune=native
Note that you need Rtools for compiling from source on Windows: https://cran.r-project.org/bin/windows/Rtools/
Thanks for your suggestion, but I think I installed all packages from source, including the 1.13.2. I also have Rtools on all my computers. Does the output of readLines(system.file("cc", package="data.table")) suggest otherwise?
Not it doesn't.
I think we need to wait for revisit of forder to figure out the fix performance regression.
Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.
Rversion dt_version user.self sys.self elapsed
1: 3.6.1 1.11.8 6.39 0.28 6.98
2: 3.6.1 1.13.2 9.33 1.94 11.84
Rversion dt_version user.self sys.self elapsed
1: 4.0.1 1.11.8 6.53 0.63 7.78
2: 4.0.1 1.13.2 6.23 0.33 7.34
@jangorecki: I agree.
@shrektan: This is good news. I'm currently trying to run my code on my fourth server (a Linux one, this time), to see whether the problem is specific to Windows. I'll let you know if it finally works.
@jangorecki : you wrote in your first reply:
Performance regression of setkey (internally forder) on character vector is a known issue.
I just thought this morning that in my case the performance problem exists for both character and integer vectors. I don't know whether it matters for solving this issue.
@oliviermeslin thanks for pointing that out, then it is not strictly duplicate. On Windows it is generally more tricky due to being not that easily reproducible.
@shrektan : I re-ran all my tests on the new Windows 10 server, comparing several R versions. I confirm your finding: the performance problem of setkey is not reproducible with R 4.0.2, but is present for R 3.3.3 and R 3.6.3. Maybe this can help to figure out where the problem comes from.
R version | data.table version | user time | system time | elapsed time
:--: | -- | --: | --: | --:
3.3.3 | 1.10.4.3 | 6,83 | 0,09 | 6,80
3.6.3 | 1.10.4.3 | 9,70 | 0,13 | 9,67
4.0.2 | 1.10.4.3 | 8,10 | 0,11 | 8,08
3.3.3 | 1.11.8 | 6,97 | 0,11 | 6,94
3.6.3 | 1.11.8 | 10,08 | 0,08 | 9,99
4.0.2 | 1.11.8 | 8,03 | 0,11 | 8,00
3.3.3 | 1.12.0 | 10,31 | 14,41 | 66,55
3.6.3 | 1.12.0 | 12,92 | 13,25 | 82,96
4.0.2 | 1.12.0 | 8,97 | 4,33 | 8,22
3.3.3 | 1.13.0 | 9,19 | 9,79 | 68,68
3.6.3 | 1.13.0 | 8,43 | 7,61 | 66,22
4.0.2 | 1.13.0 | 7,09 | 0,75 | 6,95
3.3.3 | 1.13.2 | 11,78 | 20,98 | 69,75
3.6.3 | 1.13.2 | 12,50 | 20,18 | 66,33
4.0.2 | 1.13.2 | 7,41 | 0,64 | 7,17
This is amazing documentation. Regarding character vs. integer, is there profiling of an integer column only that shows performance degradation? The timings seemed based on as.character(sample(5e6L, 5e6L, FALSE)). Note, I'd propose maybe closing the other similar issues; this is pretty definitive.
Also... since 4.0.2 addresses this, are issues ever closed by new versions of R?
About the version, since we depend on 3.1, if we can identify a root cause
fix we can do on our side, we should do it. My guess is such fixes should
usually translate to performance improvements at HEAD as well. That said,
prioritization is harder.
I think generally users looking for best performance should be using recent
R & recent data.table (and when that's not true it's a priority to
fix/mitigate if there was some explicit tradeoff made). If indeed we can
attribute it to R specifically, we can probably move on; it comes back to
striving to understand the root cause.
Just my 2 cents
On Fri, Nov 13, 2020 at 7:59 PM Cole Miller notifications@github.com
wrote:
This is amazing documentation. Regarding character vs. integer, is there
profiling of an integer column only that shows performance degradation? The
timings seemed based on as.character(sample(5e6L, 5e6L, FALSE)). Note,
I'd propose maybe closing the other similar issues; this is pretty
definitive.Also... since 4.0.2 addresses this, are issues ever closed by new versions
of R?—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/Rdatatable/data.table/issues/4788#issuecomment-727104847,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AB2BA5NOX3I3KDKO2CS4B3TSPXJAJANCNFSM4TFDW6DQ
.
I just want to flag/remind that this kind of performance regression may be hard to reproduce
Interestingly, I just reproduce it on R3.6.1. And I double confirm it's not reproducible on R4.0.1.
R3.6.1
Rversion dt_version user.self sys.self elapsed 1: 3.6.1 1.11.8 6.39 0.28 6.98 2: 3.6.1 1.13.2 9.33 1.94 11.84R4.0.1
Rversion dt_version user.self sys.self elapsed 1: 4.0.1 1.11.8 6.53 0.63 7.78 2: 4.0.1 1.13.2 6.23 0.33 7.34
In my experience, performance of R code can vary considerably from one machine to another. Differences can be observed not just in absolute run time (as expected) but even in the relative performance. For example, I have one Windows computer in which a particular inner join is 5 times as fast using data.table over dplyr::inner_join and another Windows computer in which it is twice as slow! (So much so that I actually switch the method based on the value of Sys.getenv("COMPUTERNAME")!)
I would keep other issues open and close them when fix will be ready and we will test the exact code examples there.
Most helpful comment
@shrektan : I re-ran all my tests on the new Windows 10 server, comparing several R versions. I confirm your finding: the performance problem of
setkeyis not reproducible with R 4.0.2, but is present for R 3.3.3 and R 3.6.3. Maybe this can help to figure out where the problem comes from.R version |
data.tableversion | user time | system time | elapsed time:--: | -- | --: | --: | --:
3.3.3 | 1.10.4.3 | 6,83 | 0,09 | 6,80
3.6.3 | 1.10.4.3 | 9,70 | 0,13 | 9,67
4.0.2 | 1.10.4.3 | 8,10 | 0,11 | 8,08
3.3.3 | 1.11.8 | 6,97 | 0,11 | 6,94
3.6.3 | 1.11.8 | 10,08 | 0,08 | 9,99
4.0.2 | 1.11.8 | 8,03 | 0,11 | 8,00
3.3.3 | 1.12.0 | 10,31 | 14,41 | 66,55
3.6.3 | 1.12.0 | 12,92 | 13,25 | 82,96
4.0.2 | 1.12.0 | 8,97 | 4,33 | 8,22
3.3.3 | 1.13.0 | 9,19 | 9,79 | 68,68
3.6.3 | 1.13.0 | 8,43 | 7,61 | 66,22
4.0.2 | 1.13.0 | 7,09 | 0,75 | 6,95
3.3.3 | 1.13.2 | 11,78 | 20,98 | 69,75
3.6.3 | 1.13.2 | 12,50 | 20,18 | 66,33
4.0.2 | 1.13.2 | 7,41 | 0,64 | 7,17