Data.table: Selecting from data.table by row is very slow

Created on 30 Jul 2019 · 6Comments · Source: Rdatatable/data.table

allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) {
  someCalculation <- row[["v1"]] + 1
}
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   4.50    0.02    4.55 

allIterations <- as.data.table(allIterations)
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   53.78   25.05   78.46

I'm working on a R project that involves applying fairly complicated functions across data.table or data.frame by rows.
In cases where vectorizing is not a good option, one might need to loop through rows, and that's when I realized selecting by row number from a data.table is actually much slower than from a data.frame.
I guess selecting by row number is not a recommended practice for data.table? Or would the team be interested in looking into this and optimize the performance?

I have more details about my test here.

performance

Source

chnynf

Most helpful comment

When selecting single row by its integer index it make sense to switch to single threaded mode, so setting setDTthreads(1L) might help. Related issue #3175.
It is possible to match performance of data.frame subset by integer index, but it is not exported.
DF 3.697s
DT 3.579s

library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
  stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
  if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
  .Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1

system.time(for (r in 1:n) do(df[r, ]))
#   user  system elapsed 
#  3.693   0.003   3.697 

setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 73.497   0.299  19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
# 21.125   0.128   5.488 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
# 28.016   0.179   7.294 

setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 12.619   0.128  12.749 
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
#  3.538   0.040   3.579 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
#  4.923   0.088   5.012

It could be handled internally transparently but requires a little bit of rewrite [.data.table because i argument can take various forms, where NSE processing makes it harder for early detecting input type and optimisation.

jangorecki on 31 Jul 2019

👍3

All 6 comments

The main reason is not about using row number to select the rows or not. It's because the loop invokes the data.table's function call too many times. data.table is fast due to internal optimization, which comes with a cost. It means the [ call in data.table will do much more things (optimizing, checks, etc.) than in data.frame. Apparently, in this special looping case, all the optimizing efforts are in vain.

If loop on all the rows is unavoidable, I suggest you to use purrr::pmap().

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) {
  row$v1 + 1
}
res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.110s elapsed (0.090s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.510s elapsed (0.470s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.030s elapsed (0.010s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE

^{Created on 2019-07-31 by the reprex package (v0.2.1)}

shrektan on 31 Jul 2019

👍2

Confirming what @shrektan wrote. Anyway I think we should be able to speed up such things pretty easily.

jangorecki on 31 Jul 2019

library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
  stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
  if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
  .Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1

system.time(for (r in 1:n) do(df[r, ]))
#   user  system elapsed 
#  3.693   0.003   3.697 

setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 73.497   0.299  19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
# 21.125   0.128   5.488 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
# 28.016   0.179   7.294 

setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 12.619   0.128  12.749 
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
#  3.538   0.040   3.579 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
#  4.923   0.088   5.012

jangorecki on 31 Jul 2019

👍3

Some progress towards this issue has been made in #4484, but the overhead of [.data.table is still significant. I measured time of [.data.table internals and tried to escape as much extra code as possible, but speed up I was able to get was around 13%. I am not sure if we want another extra escape branch just for a 13% gain.
To address this issue fully, we either have to:

Provide non-NSE interface for i argument, so it behaves like data.frame's i arg. We already provide that for j argument via with. So this could be made using with=c(i=FALSE, j=TRUE), ~~already proposed somewhere but don't remember where~~ (found it, that was you :) ), #4485. Alternatively we could use another argument, like existing which #3736.
Rewrite initial parts of [.data.table

jangorecki on 25 May 2020

👍1

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

ColeMiller1 on 1 Jun 2020

👍2

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

Yes, the test was on windows. I tried your approach on my windows machine and it is much faster.

Thank you guys for working on this!

chnynf on 3 Jun 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings