Hey guys, I am struggling with removing rows from a data.table in an efficient way. With SQL I have DELETE FROM table_name WHERE condition, but with data.table I only find examples of people doing dt <- dt[condition]. This seems to be 2 time slower than base R data.frame...
Example:
library(microbenchmark)
library(data.table)
dt<-data.table(id=sample(letters,1000000,T),var=rnorm(1000000))
df<-data.frame(id=sample(letters,1000000,T),var=rnorm(1000000))
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
```R
Unit: microseconds
expr min lq mean median uq max neval
dt <- dt[id != "a"] 222.561 243.5625 374.6974 280.055 351.798 7184.560 100
df <- df[df$id != "a", ] 85.059 108.7100 151.5715 127.215 173.659 1339.389 100
(borrowed and adjusted from here: https://stackoverflow.com/a/22655130/3494126)
---------------
```R
> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.4 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4-4 bindrcpp_0.2.2 subprocess_0.8.2 htmlwidgets_1.2
[5] googlesheets_0.2.2 raster_2.6-7 exifr_0.2.1 magick_1.8
[9] bazar_1.0.6 glue_1.2.0 lubridate_1.7.4 matrixStats_0.53.1
[13] rgdal_1.2-18 sp_1.2-7 leaflet_2.0.0 lwgeom_0.1-4
[17] sf_0.6-2 pbsat_0.1.17 xml2_1.2.0 data.table_1.10.4-3
[21] forcats_0.3.0 stringr_1.3.0 dplyr_0.7.4 purrr_0.2.4
[25] readr_1.1.1 tidyr_0.8.0 tibble_1.4.2 ggplot2_2.2.1.9000
[29] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] nlme_3.1-137 RColorBrewer_1.1-2 httr_1.3.1 tools_3.4.4 R6_2.2.2
[6] spData_0.2.8.3 rgeos_0.3-26 DBI_0.8 lazyeval_0.2.1 colorspace_1.3-2
[11] withr_2.1.2 gridExtra_2.3 mnormt_1.5-5 curl_3.2 compiler_3.4.4
[16] cli_1.0.0 rvest_0.3.2 scales_0.5.0.9000 classInt_0.2-3 psych_1.8.3.3
[21] digest_0.6.15 foreign_0.8-70 pkgconfig_2.0.1 htmltools_0.3.6 rlang_0.2.0.9001
[26] readxl_1.1.0 rstudioapi_0.7 shiny_1.0.5 bindr_0.1.1 jsonlite_1.5
[31] crosstalk_1.0.0 magrittr_1.5 kimisc_0.4 Rcpp_0.12.16 munsell_0.4.3
[36] RPostgreSQL_0.6-2 viridis_0.5.1 stringi_1.1.7 yaml_2.1.18 jqr_1.0.0
[41] plyr_1.8.4 grid_3.4.4 maptools_0.9-2 promises_1.0.1 crayon_1.3.4
[46] geojsonio_0.6.0 udunits2_0.13 lattice_0.20-35 haven_1.1.1 geojson_0.2.0
[51] hms_0.4.2 knitr_1.20 pillar_1.2.1 reshape2_1.4.3 V8_1.5
[56] modelr_0.1.1 httpuv_1.4.1 cellranger_1.1.0 gtable_0.2.0 assertthat_0.2.0
[61] mime_0.5 xtable_1.8-2 broom_0.4.4 e1071_1.6-8 later_0.7.1
[66] viridisLite_0.3.0 class_7.3-14 memoise_1.1.0 units_0.5-1
If you do it on bigger data, the timing is reversed, and more.
library(microbenchmark)
library(data.table)
NN = 1e7
dt<-data.table(id=sample(letters,NN,T),var=rnorm(NN))
df<-data.frame(id=sample(letters,NN,T),var=rnorm(NN))
microbenchmark(dt <- dt[ id != "a"], df <- df[ df$id != "a",])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt <- dt[id != "a"] 148.0570 165.3594 210.1427 191.2038 229.1141 389.5252 100
# df <- df[df$id != "a", ] 810.7681 877.9762 944.7228 916.4006 1025.6018 1099.8863 100
(in fact my timings show data.table faster on NN = 1e6 as well).
All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.
data.table. 1.11.2 is on CRAN
Most helpful comment
If you do it on bigger data, the timing is reversed, and more.
(in fact my timings show
data.tablefaster onNN = 1e6as well).All goes to show that benchmarking with small-millisecond-or-smaller timings is not particularly useful.
data.table. 1.11.2 is on CRAN