I found that mutate is much slower (almost two orders of magnitude) for a data.frame with a geometry column:
library(sf)
library(dplyr)
library(microbenchmark)
set.seed(123)
m <- tibble(x = runif(1e5), y = runif(1e5))
m1 <- st_as_sf(m, coords = c("x", "y"))
microbenchmark(mutate(m, v = runif(1e5)),
mutate(m1, v = runif(1e5)), times = 10)
Which gives me
Unit: milliseconds
expr min lq mean median uq max neval
mutate(m, v = runif(1e+05)) 3.444282 3.5650 3.695005 3.705459 3.81263 3.89584 10
mutate(m1, v = runif(1e+05)) 324.657012 326.5985 331.125082 329.085515 333.60428 343.87903 10
Is this expected?
I am using
> sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.5.2 (2018-12-20)
os Ubuntu 18.04.2 LTS
system x86_64, linux-gnu
ui RStudio
language en_US:en
collate en_US.UTF-8
ctype en_US.UTF-8
tz Europe/Berlin
date 2019-04-05
─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.5.2)
bench 1.0.1 2018-06-06 [1] CRAN (R 3.5.2)
class 7.3-15 2019-01-01 [4] CRAN (R 3.5.2)
classInt 0.3-1 2018-12-18 [1] CRAN (R 3.5.2)
cli 1.1.0 2019-03-19 [1] CRAN (R 3.5.2)
crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.1)
DBI 1.0.0 2018-05-02 [1] CRAN (R 3.5.1)
digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.1)
dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
e1071 1.7-1 2019-03-19 [1] CRAN (R 3.5.2)
evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2)
glue 1.3.1 2019-03-12 [1] CRAN (R 3.5.2)
htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.1)
knitr 1.22 2019-03-08 [1] CRAN (R 3.5.2)
magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.1)
microbenchmark * 1.4-6 2018-10-18 [1] CRAN (R 3.5.1)
pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.2)
pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.1)
profmem 0.5.0 2018-01-30 [1] CRAN (R 3.5.2)
purrr 0.3.2 2019-03-15 [1] CRAN (R 3.5.2)
R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2)
Rcpp 1.0.1 2019-03-17 [1] CRAN (R 3.5.2)
rlang 0.3.2 2019-03-21 [1] CRAN (R 3.5.2)
rmarkdown 1.12 2019-03-14 [1] CRAN (R 3.5.2)
rstudioapi 0.10 2019-03-19 [1] CRAN (R 3.5.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.2)
sf * 0.7-3 2019-02-21 [1] CRAN (R 3.5.2)
tibble 2.1.1 2019-03-16 [1] CRAN (R 3.5.2)
tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.1)
units 0.6-2 2018-12-05 [1] CRAN (R 3.5.2)
withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.1)
xfun 0.5 2019-02-20 [1] CRAN (R 3.5.2)
yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.1)
Hi,
I got the same behaviour for summaries. It is much faster if I deactivated the geometry column with %>% st_set_geometry(NULL) to do operations on the data that doesn't involve geometry.
Might not be the solution to your problem but can help others with a similar concern.
@jmsigner this should fix the mutate performance - there was indeed a lot of unnecessary overhead going on. @Bakaniko : summarise is another story: it actually does something with the geometries (computing unions) -- your solution for that makes sense.
@Bakaniko See also https://github.com/r-spatial/sf/commit/7ad5d75a2df3e06788e8e9b3fb029fa1d2757441 for a change to summarize that should avoid the need for the st_set_geometry(NULL) workaround.
If I understand well, summarize get a do_union=FALSE option. That's great !
And the potential performance gain is great too ! thanks @dbaston !
Most helpful comment
@Bakaniko See also https://github.com/r-spatial/sf/commit/7ad5d75a2df3e06788e8e9b3fb029fa1d2757441 for a change to
summarizethat should avoid the need for thest_set_geometry(NULL)workaround.