Data.table: Bug report - RSession Hangs

Created on 17 Dec 2015  路  5Comments  路  Source: Rdatatable/data.table

Rsession hangs

When a data.table with large numbers of columns is queried using .SD, first this takes much longer than just creating the DT (from about a minute to nearly 10 minutes), then after a while R starts running in the background for large period of time (5-10 minutes) even without any command. We can see on the Activity Monitor that the rsession process is on at 100% and RStudio unresponsive. Note that R library is in a custom folder and this happens more often if many queries are done on DT. Tried turning off options(datatable.auto.index=FALSE) to no avail.
Using the latest versions of RStudio (0.99.489), R (3.2.3), and data.table (1.9.6) under OS X 10.9.5 (Mavericks) on x86_64-apple-darwin13.4.0 (64-bit). attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] data.table_1.9.6 microbenchmark_1.4-2.1
loaded via a namespace (and not attached): Rcpp_0.12.2 digest_0.6.8 MASS_7.3-45 chron_2.3-47 grid_3.2.3 plyr_1.8.3 gtable_0.1.2 magrittr_1.5 scales_0.3.0 ggplot2_1.0.1 stringi_1.0-1 reshape2_1.4.1 proto_0.3-10 tools_3.2.3 stringr_1.0.0 munsell_0.4.2 colorspace_1.2-6

.libPaths('~/Dropbox/R_packages/library/')    
require(data.table)
require(microbenchmark)
mt=rep(rownames(mtcars)[1:25],20)
st=rep(state.name,10)
system.time(data.table(mt=mt,st=st,matrix(sample(1:(30000L*500),30000*500,replace=T),
                                      nrow = 500,ncol = 30000),key='mt')->DT) # 4-5 secs
system.time(DT[,.SD,by=st][,.(mt,st,V2,V3,V4)]) # 67 to 497 secs - slow, because copying every column to .SD
microbenchmark(DT[,.SD,by=st,.SDcols= c('V2','V3','V4')]) # 12.9 ns median - fast, because .SD contains only .SDcols
boxplot(microbenchmark(DT[,lapply(.SD,median),by=.(st,mt),.SDcols= c('V2','V3','V4')], # 19.7 median
           DT[,lapply(.(V2,V3,V4),median),by=.(st,mt)])->res,notch=T) # 21.4 median, significantly higher
setkey(DT,mt,st) # this command can cause the hang; worse when many other queries are done on DT
performance

Most helpful comment

not sure this line with verbose=TRUE is working as expected:

deparse(jsub,width.cutoff=200L)

width.cutoff is _per-line_ but there are hundreds of lines for this jsub. Should add nline=1L as well?

All 5 comments

I cannot reproduce the _hang_ on Ubuntu R 3.2.3 and 1.9.7.
Regarding your code, second expression, the slower one, can be also written as DT[,list(V2=median(V2),V3=median(V3),V4=median(V4)),by=.(st,mt)], it will be faster.

Hm, the performance hit seems to be due to optimising .SD -- see https://github.com/Rdatatable/data.table/issues/735.

Seems like having a lot of columns in j as list(col1, col2, ...) takes a big hit. @mattdowle, thoughts?

options(datatable.optimize=0L) # without optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#   0.481   0.012   0.502
options(datatable.optimize=Inf) # with optimisation
system.time(DT[,.SD,by=st])
#    user  system elapsed 
#  53.125   8.002  61.784 

Can't reproduce the session hang.

@Jorges1000 is this still a problem in the latest releases?

not sure this line with verbose=TRUE is working as expected:

deparse(jsub,width.cutoff=200L)

width.cutoff is _per-line_ but there are hundreds of lines for this jsub. Should add nline=1L as well?

Looks like it might be dotN() :

> mt = rep(rownames(mtcars)[1:25],20)
> st = rep(state.name,10)
> DT = data.table(mt=mt, st=st, matrix(sample(1:(30000L*500),30000*500,replace=T),
       nrow=500,ncol=30000), key='mt')
> options(datatable.optimize=0L)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
  0.512   0.012   0.367 
> options(datatable.optimize=Inf)
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 25.083   3.157  28.107 
> Rprof()
> system.time(DT[,.SD,by=st])
   user  system elapsed 
 24.321   2.708  26.897 
> Rprof(NULL)
> summaryRprof()
$by.self
               self.time self.pct total.time total.pct
"[.data.table"     13.88    51.26      27.02     99.78
"dotN"             13.12    48.45      13.12     48.45
"gc"                0.06     0.22       0.06      0.22
"c"                 0.02     0.07       0.02      0.07

@MichaelChirico commented here, that call to dotN() is redundant.

Was this page helpful?
0 / 5 - 0 ratings