Data.table: Efficiently remove rows

Created on 18 May 2018  路  1Comment  路  Source: Rdatatable/data.table

Hey guys, I am struggling with removing rows from a data.table in an efficient way. With SQL I have DELETE FROM table_name WHERE condition, but with data.table I only find examples of people doing dt <- dt[condition]. This seems to be 2 time slower than base R data.frame...


Example:

library(microbenchmark)
library(data.table)

dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))

microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])

```R
Unit: microseconds
expr min lq mean median uq max neval
dt <- dt[id != "a"] 222.561 243.5625 374.6974 280.055 351.798 7184.560 100
df <- df[df$id != "a", ] 85.059 108.7100 151.5715 127.215 173.659 1339.389 100

(borrowed and adjusted from here: https://stackoverflow.com/a/22655130/3494126)


---------------

```R
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4-4 bindrcpp_0.2.2       subprocess_0.8.2     htmlwidgets_1.2     
 [5] googlesheets_0.2.2   raster_2.6-7         exifr_0.2.1          magick_1.8          
 [9] bazar_1.0.6          glue_1.2.0           lubridate_1.7.4      matrixStats_0.53.1  
[13] rgdal_1.2-18         sp_1.2-7             leaflet_2.0.0        lwgeom_0.1-4        
[17] sf_0.6-2             pbsat_0.1.17         xml2_1.2.0           data.table_1.10.4-3 
[21] forcats_0.3.0        stringr_1.3.0        dplyr_0.7.4          purrr_0.2.4         
[25] readr_1.1.1          tidyr_0.8.0          tibble_1.4.2         ggplot2_2.2.1.9000  
[29] tidyverse_1.2.1     

loaded via a namespace (and not attached):
 [1] nlme_3.1-137       RColorBrewer_1.1-2 httr_1.3.1         tools_3.4.4        R6_2.2.2          
 [6] spData_0.2.8.3     rgeos_0.3-26       DBI_0.8            lazyeval_0.2.1     colorspace_1.3-2  
[11] withr_2.1.2        gridExtra_2.3      mnormt_1.5-5       curl_3.2           compiler_3.4.4    
[16] cli_1.0.0          rvest_0.3.2        scales_0.5.0.9000  classInt_0.2-3     psych_1.8.3.3     
[21] digest_0.6.15      foreign_0.8-70     pkgconfig_2.0.1    htmltools_0.3.6    rlang_0.2.0.9001  
[26] readxl_1.1.0       rstudioapi_0.7     shiny_1.0.5        bindr_0.1.1        jsonlite_1.5      
[31] crosstalk_1.0.0    magrittr_1.5       kimisc_0.4         Rcpp_0.12.16       munsell_0.4.3     
[36] RPostgreSQL_0.6-2  viridis_0.5.1      stringi_1.1.7      yaml_2.1.18        jqr_1.0.0         
[41] plyr_1.8.4         grid_3.4.4         maptools_0.9-2     promises_1.0.1     crayon_1.3.4      
[46] geojsonio_0.6.0    udunits2_0.13      lattice_0.20-35    haven_1.1.1        geojson_0.2.0     
[51] hms_0.4.2          knitr_1.20         pillar_1.2.1       reshape2_1.4.3     V8_1.5            
[56] modelr_0.1.1       httpuv_1.4.1       cellranger_1.1.0   gtable_0.2.0       assertthat_0.2.0  
[61] mime_0.5           xtable_1.8-2       broom_0.4.4        e1071_1.6-8        later_0.7.1       
[66] viridisLite_0.3.0  class_7.3-14       memoise_1.1.0      units_0.5-1   
duplicate

Most helpful comment

If you do it on bigger data, the timing is reversed, and more.

library(microbenchmark)
library(data.table)

NN = 1e7
dt<-data.table(id=sample(letters,NN,T),var=rnorm(NN))
df<-data.frame(id=sample(letters,NN,T),var=rnorm(NN))

microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
# Unit: milliseconds
#                      expr      min       lq     mean   median        uq       max neval
#       dt <- dt[id != "a"] 148.0570 165.3594 210.1427 191.2038  229.1141  389.5252   100
#  df <- df[df$id != "a", ] 810.7681 877.9762 944.7228 916.4006 1025.6018 1099.8863   100

(in fact my timings show data.table faster on NN = 1e6 as well).

All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.

  1. Please update your data.table. 1.11.2 is on CRAN
  2. See the benchmarking vignette
  3. See #635 (your question is duplicate of this)

>All comments

If you do it on bigger data, the timing is reversed, and more.

library(microbenchmark)
library(data.table)

NN = 1e7
dt<-data.table(id=sample(letters,NN,T),var=rnorm(NN))
df<-data.frame(id=sample(letters,NN,T),var=rnorm(NN))

microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
# Unit: milliseconds
#                      expr      min       lq     mean   median        uq       max neval
#       dt <- dt[id != "a"] 148.0570 165.3594 210.1427 191.2038  229.1141  389.5252   100
#  df <- df[df$id != "a", ] 810.7681 877.9762 944.7228 916.4006 1025.6018 1099.8863   100

(in fact my timings show data.table faster on NN = 1e6 as well).

All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.

  1. Please update your data.table. 1.11.2 is on CRAN
  2. See the benchmarking vignette
  3. See #635 (your question is duplicate of this)
Was this page helpful?
0 / 5 - 0 ratings