allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) {
someCalculation <- row[["v1"]] + 1
}
system.time(
{
for (r in 1:nrow(allIterations)) {
DoSomething(allIterations[r, ])
}
}
)
## user system elapsed
## 4.50 0.02 4.55
allIterations <- as.data.table(allIterations)
system.time(
{
for (r in 1:nrow(allIterations)) {
DoSomething(allIterations[r, ])
}
}
)
## user system elapsed
## 53.78 25.05 78.46
I'm working on a R project that involves applying fairly complicated functions across data.table or data.frame by rows.
In cases where vectorizing is not a good option, one might need to loop through rows, and that's when I realized selecting by row number from a data.table is actually much slower than from a data.frame.
I guess selecting by row number is not a recommended practice for data.table? Or would the team be interested in looking into this and optimize the performance?
I have more details about my test here.
The main reason is not about using row number to select the rows or not. It's because the loop invokes the data.table's function call too many times. data.table is fast due to internal optimization, which comes with a cost. It means the [ call in data.table will do much more things (optimizing, checks, etc.) than in data.frame. Apparently, in this special looping case, all the optimizing efforts are in vain.
If loop on all the rows is unavoidable, I suggest you to use purrr::pmap().
df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) {
row$v1 + 1
}
res1 <- res2 <- res3 <- res4 <- double(nrow(df))
t <- proc.time()
for (r in 1:nrow(df)) {
res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.110s elapsed (0.090s cpu)"
dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.510s elapsed (0.470s cpu)"
t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.030s elapsed (0.010s cpu)"
all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
Created on 2019-07-31 by the reprex package (v0.2.1)
Confirming what @shrektan wrote. Anyway I think we should be able to speed up such things pretty easily.
When selecting single row by its integer index it make sense to switch to single threaded mode, so setting setDTthreads(1L) might help. Related issue #3175.
It is possible to match performance of data.frame subset by integer index, but it is not exported.
DF 3.697s
DT 3.579s
library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
.Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1
system.time(for (r in 1:n) do(df[r, ]))
# user system elapsed
# 3.693 0.003 3.697
setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
# user system elapsed
# 73.497 0.299 19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
# user system elapsed
# 21.125 0.128 5.488
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
# user system elapsed
# 28.016 0.179 7.294
setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
# user system elapsed
# 12.619 0.128 12.749
system.time(for (r in 1:n) do(frow(dt2, r)))
# user system elapsed
# 3.538 0.040 3.579
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
# user system elapsed
# 4.923 0.088 5.012
It could be handled internally transparently but requires a little bit of rewrite [.data.table because i argument can take various forms, where NSE processing makes it harder for early detecting input type and optimisation.
Some progress towards this issue has been made in #4484, but the overhead of [.data.table is still significant. I measured time of [.data.table internals and tried to escape as much extra code as possible, but speed up I was able to get was around 13%. I am not sure if we want another extra escape branch just for a 13% gain.
To address this issue fully, we either have to:
i argument, so it behaves like data.frame's i arg. We already provide that for j argument via with. So this could be made using with=c(i=FALSE, j=TRUE), which #3736.[.data.tableJust promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.
Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.
library(data.table) ##1.12.8
setDTthreads(1L)
df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1
res1 <- res2 <- res3 <- res4 <- double(nrow(df))
t <- proc.time()
for (r in 1:nrow(df)) {
res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"
dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"
t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"
t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"
all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE
Just promoting the idea - using
by = 1:nrow(dt)solves this issue as well and is actually the fastest of the presented options.Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.
library(data.table) ##1.12.8 setDTthreads(1L) df <- data.frame(v1 = runif(1e3), v2 = runif(1e3)) cal <- function(row) row$v1 + 1 res1 <- res2 <- res3 <- res4 <- double(nrow(df)) t <- proc.time() for (r in 1:nrow(df)) { res1[r] <- cal(df[r, ]) } data.table::timetaken(t) #> [1] "0.050s elapsed (0.030s cpu)" dt <- data.table::as.data.table(df) t <- proc.time() for (r in 1:nrow(dt)) { res2[r] <- cal(dt[r, ]) } data.table::timetaken(t) #> [1] "0.240s elapsed (0.210s cpu)" t <- proc.time() res3 <- purrr::pmap_dbl(dt, function(...) { cal(list(...)) }) data.table::timetaken(t) #> [1] "0.060s elapsed (0.040s cpu)" t <- proc.time() res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1 data.table::timetaken(t) #> [1] "0.010s elapsed (0.000s cpu)" all.equal(res2, res1) #> [1] TRUE all.equal(res3, res1) #> [1] TRUE all.equal(res4, res1) #> [1] TRUE
Yes, the test was on windows. I tried your approach on my windows machine and it is much faster.
Thank you guys for working on this!
Most helpful comment
When selecting single row by its integer index it make sense to switch to single threaded mode, so setting
setDTthreads(1L)might help. Related issue #3175.It is possible to match performance of data.frame subset by integer index, but it is not exported.
DF 3.697s
DT 3.579s
It could be handled internally transparently but requires a little bit of rewrite
[.data.tablebecauseiargument can take various forms, where NSE processing makes it harder for early detecting input type and optimisation.