Sf: st_centroid(..., of_largest_polygon = TRUE) doesn't scale well

Created on 22 Jan 2018  ยท  5Comments  ยท  Source: r-spatial/sf

st_centroid(..., of_largest_polygon = TRUE) becomes prohibitively slow when working with medium to large size sf objects.

Here are the benchmark results for an object with 10K rows:

  • When of_largest_polygon = FALSE:
    r ## elapsed user.self sys.self ## 1 1.09 1.08 0.02
  • When of_largest_polygon = TRUE:
    r ## elapsed user.self sys.self ## 1 275.61 275.06 0.11

Can this function be refactored to reduce (or eliminate) this slowness?

Reprex + Session info

library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(tidyverse) 
library(rbenchmark) 

demo(nc, ask = FALSE, echo = FALSE)  

nc <- st_transform(nc, 32119)

nc_1e4 <- list(nc) %>% rep(times = 1e2) %>% reduce(rbind)

cols <- c("elapsed", "user.self", "sys.self")

#  of_largest_polygon = FALSE
benchmark(
  {
    st_centroid(nc_1e4, of_largest_polygon = FALSE)
  },
  replications = 5, columns = cols
)
##   elapsed user.self sys.self
## 1    1.09      1.08     0.02

#  of_largest_polygon = TRUE
benchmark(
  {
    st_centroid(nc_1e4, of_largest_polygon = TRUE)
  },
  replications = 5, columns = cols
)
##   elapsed user.self sys.self
## 1  275.61    275.06     0.11
devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/Los_Angeles         
##  date     2018-01-22
## Packages -----------------------------------------------------------------
##  package    * version     date       source                            
##  assertthat   0.2.0       2017-04-11 CRAN (R 3.4.2)                    
##  backports    1.1.0       2017-05-22 CRAN (R 3.4.0)                    
##  base       * 3.4.0       2017-04-21 local                             
##  bindr        0.1         2016-11-13 CRAN (R 3.4.2)                    
##  bindrcpp     0.2         2017-06-17 CRAN (R 3.4.2)                    
##  broom        0.4.3       2017-11-20 CRAN (R 3.4.3)                    
##  cellranger   1.1.0       2016-07-27 CRAN (R 3.4.2)                    
##  class        7.3-14      2015-08-30 CRAN (R 3.4.0)                    
##  classInt     0.1-24      2017-04-16 CRAN (R 3.4.2)                    
##  cli          1.0.0       2017-11-05 CRAN (R 3.4.2)                    
##  colorspace   1.3-2       2016-12-14 CRAN (R 3.4.2)                    
##  compiler     3.4.0       2017-04-21 local                             
##  crayon       1.3.4       2017-10-30 Github (r-lib/crayon@b5221ab)     
##  datasets   * 3.4.0       2017-04-21 local                             
##  DBI          0.7         2017-06-18 CRAN (R 3.4.2)                    
##  devtools     1.13.2      2017-06-02 CRAN (R 3.4.0)                    
##  digest       0.6.13      2017-12-14 CRAN (R 3.4.3)                    
##  dplyr      * 0.7.4       2017-09-28 CRAN (R 3.4.2)                    
##  e1071        1.6-8       2017-02-02 CRAN (R 3.4.2)                    
##  evaluate     0.10.1      2017-06-24 CRAN (R 3.4.3)                    
##  forcats    * 0.2.0       2017-01-23 CRAN (R 3.4.3)                    
##  foreign      0.8-67      2016-09-13 CRAN (R 3.4.0)                    
##  ggplot2    * 2.2.1.9000  2017-12-02 Github (tidyverse/ggplot2@7b5c185)
##  glue         1.2.0.9000  2018-01-13 Github (tidyverse/glue@1592ee1)   
##  graphics   * 3.4.0       2017-04-21 local                             
##  grDevices  * 3.4.0       2017-04-21 local                             
##  grid         3.4.0       2017-04-21 local                             
##  gtable       0.2.0       2016-02-26 CRAN (R 3.4.2)                    
##  haven        1.1.0       2017-07-09 CRAN (R 3.4.2)                    
##  hms          0.4.0       2017-11-23 CRAN (R 3.4.3)                    
##  htmltools    0.3.6       2017-04-28 CRAN (R 3.4.0)                    
##  httr         1.3.1       2017-08-20 CRAN (R 3.4.2)                    
##  jsonlite     1.5         2017-06-01 CRAN (R 3.4.0)                    
##  knitr        1.18        2017-12-27 CRAN (R 3.4.3)                    
##  lattice      0.20-35     2017-03-25 CRAN (R 3.4.0)                    
##  lazyeval     0.2.1       2017-10-29 CRAN (R 3.4.2)                    
##  lubridate    1.7.1       2017-11-03 CRAN (R 3.4.2)                    
##  magrittr     1.5         2014-11-22 CRAN (R 3.4.0)                    
##  memoise      1.1.0       2017-04-21 CRAN (R 3.4.0)                    
##  methods    * 3.4.0       2017-04-21 local                             
##  mnormt       1.5-5       2016-10-15 CRAN (R 3.4.1)                    
##  modelr       0.1.1       2017-07-24 CRAN (R 3.4.2)                    
##  munsell      0.4.3       2016-02-13 CRAN (R 3.4.2)                    
##  nlme         3.1-131     2017-02-06 CRAN (R 3.4.0)                    
##  parallel     3.4.0       2017-04-21 local                             
##  pillar       1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)     
##  pkgconfig    2.0.1       2017-03-21 CRAN (R 3.4.2)                    
##  plyr         1.8.4       2016-06-08 CRAN (R 3.4.2)                    
##  psych        1.7.8       2017-09-09 CRAN (R 3.4.2)                    
##  purrr      * 0.2.4.9000  2017-12-05 Github (tidyverse/purrr@62b135a)  
##  R6           2.2.2       2017-06-17 CRAN (R 3.4.0)                    
##  rbenchmark * 1.0.0       2012-08-30 CRAN (R 3.4.1)                    
##  Rcpp         0.12.14     2017-11-23 CRAN (R 3.4.2)                    
##  readr      * 1.1.1       2017-05-16 CRAN (R 3.4.2)                    
##  readxl       1.0.0       2017-04-18 CRAN (R 3.4.2)                    
##  reshape2     1.4.2       2016-10-22 CRAN (R 3.4.2)                    
##  rlang        0.1.6       2017-12-21 CRAN (R 3.4.3)                    
##  rmarkdown    1.8         2017-11-17 CRAN (R 3.4.2)                    
##  rprojroot    1.3-2       2018-01-03 CRAN (R 3.4.3)                    
##  rvest        0.3.2       2016-06-17 CRAN (R 3.4.2)                    
##  scales       0.5.0.9000  2017-12-02 Github (hadley/scales@d767915)    
##  sf         * 0.6-1       2018-01-21 Github (r-spatial/sf@fb52b1e)     
##  stats      * 3.4.0       2017-04-21 local                             
##  stringi      1.1.6       2017-11-17 CRAN (R 3.4.2)                    
##  stringr    * 1.2.0       2017-02-18 CRAN (R 3.4.0)                    
##  tibble     * 1.4.1.9000  2018-01-18 Github (tidyverse/tibble@64fedbd) 
##  tidyr      * 0.7.2.9000  2018-01-13 Github (tidyverse/tidyr@74bd48f)  
##  tidyverse  * 1.2.1       2017-11-14 CRAN (R 3.4.3)                    
##  tools        3.4.0       2017-04-21 local                             
##  udunits2     0.13        2016-11-17 CRAN (R 3.4.1)                    
##  units        0.5-1       2018-01-08 CRAN (R 3.4.3)                    
##  utils      * 3.4.0       2017-04-21 local                             
##  withr        2.1.1.9000  2018-01-13 Github (jimhester/withr@df18523)  
##  xml2         1.1.1       2017-01-24 CRAN (R 3.4.2)                    
##  yaml         2.1.14      2016-11-12 CRAN (R 3.4.0)

Most helpful comment

I now see a factor 4.

library(sf)
# Linking to GEOS 3.5.1, GDAL 2.1.2, proj.4 4.9.3
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(tidyverse) 
# โ”€โ”€ Attaching packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse 1.2.1 โ”€โ”€
# โœ” ggplot2 2.2.1.9000     โœ” purrr   0.2.4     
# โœ” tibble  1.4.1          โœ” dplyr   0.7.4     
# โœ” tidyr   0.7.2          โœ” stringr 1.2.0     
# โœ” readr   1.1.1          โœ” forcats 0.2.0     
# โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse_conflicts() โ”€โ”€
# โœ– dplyr::filter() masks stats::filter()
# โœ– dplyr::lag()    masks stats::lag()
library(rbenchmark) 

demo(nc, ask = FALSE, echo = FALSE)  
# Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.4/sf/gpkg/nc.gpkg' using driver `GPKG'
# Simple feature collection with 100 features and 14 fields
# Attribute-geometry relationship: 0 constant, 8 aggregate, 6 identity
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# epsg (SRID):    4267
# proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc <- st_transform(nc, 32119)
nc_1e4 <- list(nc) %>% rep(times = 1e2) %>% reduce(rbind)
cols <- c("elapsed", "user.self", "sys.self")

benchmark( {
    st_centroid(nc_1e4, of_largest_polygon = FALSE)
  }, replications = 5, columns = cols)
#   elapsed user.self sys.self
# 1   0.848      0.84    0.008

benchmark( {
    st_centroid(nc_1e4, of_largest_polygon = TRUE)
  }, replications = 5, columns = cols)
#   elapsed user.self sys.self
# 1   3.226     3.208    0.016

All 5 comments

This is like the findings in an spdep vignette (HTML version).

This commit reduces the problem above from 200 to around 15 times as slow, by skipping area calculations for MULTIPOLYGON geometries with only a single outer ring. Not sure what more can be done; profiling didn't enlighten me.

That was clearly premature.

I now see a factor 4.

library(sf)
# Linking to GEOS 3.5.1, GDAL 2.1.2, proj.4 4.9.3
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(tidyverse) 
# โ”€โ”€ Attaching packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse 1.2.1 โ”€โ”€
# โœ” ggplot2 2.2.1.9000     โœ” purrr   0.2.4     
# โœ” tibble  1.4.1          โœ” dplyr   0.7.4     
# โœ” tidyr   0.7.2          โœ” stringr 1.2.0     
# โœ” readr   1.1.1          โœ” forcats 0.2.0     
# โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse_conflicts() โ”€โ”€
# โœ– dplyr::filter() masks stats::filter()
# โœ– dplyr::lag()    masks stats::lag()
library(rbenchmark) 

demo(nc, ask = FALSE, echo = FALSE)  
# Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.4/sf/gpkg/nc.gpkg' using driver `GPKG'
# Simple feature collection with 100 features and 14 fields
# Attribute-geometry relationship: 0 constant, 8 aggregate, 6 identity
# geometry type:  MULTIPOLYGON
# dimension:      XY
# bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# epsg (SRID):    4267
# proj4string:    +proj=longlat +datum=NAD27 +no_defs
nc <- st_transform(nc, 32119)
nc_1e4 <- list(nc) %>% rep(times = 1e2) %>% reduce(rbind)
cols <- c("elapsed", "user.self", "sys.self")

benchmark( {
    st_centroid(nc_1e4, of_largest_polygon = FALSE)
  }, replications = 5, columns = cols)
#   elapsed user.self sys.self
# 1   0.848      0.84    0.008

benchmark( {
    st_centroid(nc_1e4, of_largest_polygon = TRUE)
  }, replications = 5, columns = cols)
#   elapsed user.self sys.self
# 1   3.226     3.208    0.016

I get the same results on my machine ๐Ÿ‘ Excellent work.

Closing the issue for now - as always, thanks for the quick response!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

adrfantini picture adrfantini  ยท  4Comments

happyshows picture happyshows  ยท  3Comments

duleise picture duleise  ยท  3Comments

thiagoveloso picture thiagoveloso  ยท  3Comments

faridcher picture faridcher  ยท  4Comments