st_centroid(..., of_largest_polygon = TRUE) becomes prohibitively slow when working with medium to large size sf objects.
Here are the benchmark results for an object with 10K rows:
of_largest_polygon = FALSE:r
## elapsed user.self sys.self
## 1 1.09 1.08 0.02
of_largest_polygon = TRUE:r
## elapsed user.self sys.self
## 1 275.61 275.06 0.11
Can this function be refactored to reduce (or eliminate) this slowness?
Reprex + Session info
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(tidyverse)
library(rbenchmark)
demo(nc, ask = FALSE, echo = FALSE)
nc <- st_transform(nc, 32119)
nc_1e4 <- list(nc) %>% rep(times = 1e2) %>% reduce(rbind)
cols <- c("elapsed", "user.self", "sys.self")
# of_largest_polygon = FALSE
benchmark(
{
st_centroid(nc_1e4, of_largest_polygon = FALSE)
},
replications = 5, columns = cols
)
## elapsed user.self sys.self
## 1 1.09 1.08 0.02
# of_largest_polygon = TRUE
benchmark(
{
st_centroid(nc_1e4, of_largest_polygon = TRUE)
},
replications = 5, columns = cols
)
## elapsed user.self sys.self
## 1 275.61 275.06 0.11
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Los_Angeles
## date 2018-01-22
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.2)
## bindrcpp 0.2 2017-06-17 CRAN (R 3.4.2)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.2)
## class 7.3-14 2015-08-30 CRAN (R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN (R 3.4.2)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
## compiler 3.4.0 2017-04-21 local
## crayon 1.3.4 2017-10-30 Github (r-lib/crayon@b5221ab)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.2)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.13 2017-12-14 CRAN (R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN (R 3.4.2)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3)
## foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)
## glue 1.2.0.9000 2018-01-13 Github (tidyverse/glue@1592ee1)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.2)
## hms 0.4.0 2017-11-23 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.2)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.18 2017-12-27 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.2)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.0 2017-04-21 local
## pillar 1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4.9000 2017-12-05 Github (tidyverse/purrr@62b135a)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## rbenchmark * 1.0.0 2012-08-30 CRAN (R 3.4.1)
## Rcpp 0.12.14 2017-11-23 CRAN (R 3.4.2)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.6 2017-12-21 CRAN (R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.2)
## scales 0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)
## sf * 0.6-1 2018-01-21 Github (r-spatial/sf@fb52b1e)
## stats * 3.4.0 2017-04-21 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.4.1.9000 2018-01-18 Github (tidyverse/tibble@64fedbd)
## tidyr * 0.7.2.9000 2018-01-13 Github (tidyverse/tidyr@74bd48f)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3)
## tools 3.4.0 2017-04-21 local
## udunits2 0.13 2016-11-17 CRAN (R 3.4.1)
## units 0.5-1 2018-01-08 CRAN (R 3.4.3)
## utils * 3.4.0 2017-04-21 local
## withr 2.1.1.9000 2018-01-13 Github (jimhester/withr@df18523)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.2)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
This is like the findings in an spdep vignette (HTML version).
This commit reduces the problem above from 200 to around 15 times as slow, by skipping area calculations for MULTIPOLYGON geometries with only a single outer ring. Not sure what more can be done; profiling didn't enlighten me.
That was clearly premature.
I now see a factor 4.
library(sf)
# Linking to GEOS 3.5.1, GDAL 2.1.2, proj.4 4.9.3
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(tidyverse)
# โโ Attaching packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse 1.2.1 โโ
# โ ggplot2 2.2.1.9000 โ purrr 0.2.4
# โ tibble 1.4.1 โ dplyr 0.7.4
# โ tidyr 0.7.2 โ stringr 1.2.0
# โ readr 1.1.1 โ forcats 0.2.0
# โโ Conflicts โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ tidyverse_conflicts() โโ
# โ dplyr::filter() masks stats::filter()
# โ dplyr::lag() masks stats::lag()
library(rbenchmark)
demo(nc, ask = FALSE, echo = FALSE)
# Reading layer `nc.gpkg' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.4/sf/gpkg/nc.gpkg' using driver `GPKG'
# Simple feature collection with 100 features and 14 fields
# Attribute-geometry relationship: 0 constant, 8 aggregate, 6 identity
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# epsg (SRID): 4267
# proj4string: +proj=longlat +datum=NAD27 +no_defs
nc <- st_transform(nc, 32119)
nc_1e4 <- list(nc) %>% rep(times = 1e2) %>% reduce(rbind)
cols <- c("elapsed", "user.self", "sys.self")
benchmark( {
st_centroid(nc_1e4, of_largest_polygon = FALSE)
}, replications = 5, columns = cols)
# elapsed user.self sys.self
# 1 0.848 0.84 0.008
benchmark( {
st_centroid(nc_1e4, of_largest_polygon = TRUE)
}, replications = 5, columns = cols)
# elapsed user.self sys.self
# 1 3.226 3.208 0.016
I get the same results on my machine ๐ Excellent work.
Closing the issue for now - as always, thanks for the quick response!
Most helpful comment
I now see a factor 4.