Sf: `mutate` is much slower for `sf` than for `tbl`

Created on 5 Apr 2019  ·  4Comments  ·  Source: r-spatial/sf

I found that mutate is much slower (almost two orders of magnitude) for a data.frame with a geometry column:

library(sf)
library(dplyr)
library(microbenchmark)

set.seed(123)
m <- tibble(x = runif(1e5), y = runif(1e5))
m1 <- st_as_sf(m, coords = c("x", "y"))

microbenchmark(mutate(m, v = runif(1e5)), 
               mutate(m1, v = runif(1e5)), times = 10)

Which gives me

Unit: milliseconds
                         expr        min       lq       mean     median        uq       max neval
  mutate(m, v = runif(1e+05))   3.444282   3.5650   3.695005   3.705459   3.81263   3.89584    10
 mutate(m1, v = runif(1e+05)) 324.657012 326.5985 331.125082 329.085515 333.60428 343.87903    10

Is this expected?

I am using

> sessioninfo::session_info()
─ Session info ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Ubuntu 18.04.2 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language en_US:en                    
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       Europe/Berlin               
 date     2019-04-05                  

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 package        * version date       lib source        
 assertthat       0.2.1   2019-03-21 [1] CRAN (R 3.5.2)
 bench            1.0.1   2018-06-06 [1] CRAN (R 3.5.2)
 class            7.3-15  2019-01-01 [4] CRAN (R 3.5.2)
 classInt         0.3-1   2018-12-18 [1] CRAN (R 3.5.2)
 cli              1.1.0   2019-03-19 [1] CRAN (R 3.5.2)
 crayon           1.3.4   2017-09-16 [1] CRAN (R 3.5.1)
 DBI              1.0.0   2018-05-02 [1] CRAN (R 3.5.1)
 digest           0.6.18  2018-10-10 [1] CRAN (R 3.5.1)
 dplyr          * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2)
 e1071            1.7-1   2019-03-19 [1] CRAN (R 3.5.2)
 evaluate         0.13    2019-02-12 [1] CRAN (R 3.5.2)
 glue             1.3.1   2019-03-12 [1] CRAN (R 3.5.2)
 htmltools        0.3.6   2017-04-28 [1] CRAN (R 3.5.1)
 knitr            1.22    2019-03-08 [1] CRAN (R 3.5.2)
 magrittr         1.5     2014-11-22 [1] CRAN (R 3.5.1)
 microbenchmark * 1.4-6   2018-10-18 [1] CRAN (R 3.5.1)
 pillar           1.3.1   2018-12-15 [1] CRAN (R 3.5.2)
 pkgconfig        2.0.2   2018-08-16 [1] CRAN (R 3.5.1)
 profmem          0.5.0   2018-01-30 [1] CRAN (R 3.5.2)
 purrr            0.3.2   2019-03-15 [1] CRAN (R 3.5.2)
 R6               2.4.0   2019-02-14 [1] CRAN (R 3.5.2)
 Rcpp             1.0.1   2019-03-17 [1] CRAN (R 3.5.2)
 rlang            0.3.2   2019-03-21 [1] CRAN (R 3.5.2)
 rmarkdown        1.12    2019-03-14 [1] CRAN (R 3.5.2)
 rstudioapi       0.10    2019-03-19 [1] CRAN (R 3.5.2)
 sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 3.5.2)
 sf             * 0.7-3   2019-02-21 [1] CRAN (R 3.5.2)
 tibble           2.1.1   2019-03-16 [1] CRAN (R 3.5.2)
 tidyselect       0.2.5   2018-10-11 [1] CRAN (R 3.5.1)
 units            0.6-2   2018-12-05 [1] CRAN (R 3.5.2)
 withr            2.1.2   2018-03-15 [1] CRAN (R 3.5.1)
 xfun             0.5     2019-02-20 [1] CRAN (R 3.5.2)
 yaml             2.2.0   2018-07-25 [1] CRAN (R 3.5.1)

Most helpful comment

@Bakaniko See also https://github.com/r-spatial/sf/commit/7ad5d75a2df3e06788e8e9b3fb029fa1d2757441 for a change to summarize that should avoid the need for the st_set_geometry(NULL) workaround.

All 4 comments

Hi,

I got the same behaviour for summaries. It is much faster if I deactivated the geometry column with %>% st_set_geometry(NULL) to do operations on the data that doesn't involve geometry.

Might not be the solution to your problem but can help others with a similar concern.

@jmsigner this should fix the mutate performance - there was indeed a lot of unnecessary overhead going on. @Bakaniko : summarise is another story: it actually does something with the geometries (computing unions) -- your solution for that makes sense.

@Bakaniko See also https://github.com/r-spatial/sf/commit/7ad5d75a2df3e06788e8e9b3fb029fa1d2757441 for a change to summarize that should avoid the need for the st_set_geometry(NULL) workaround.

If I understand well, summarize get a do_union=FALSE option. That's great !

And the potential performance gain is great too ! thanks @dbaston !

Was this page helpful?
0 / 5 - 0 ratings