I recently found myself scratching my head over how to do a pretty common GIS procedure using the sf package: join one set of multipolygon data to another based on the highest amount of spatial intersection.
A quick Stackoverflow search turned up a postgis solution, but reverse engineering it with sf took me quite a bit of time.
That made me wish that the sf documentation had an example showing how to apply a sequence of steps that combine verbs from sf and other tidyverse packages like dplyr, tidyr, and purrr.
After my project I threw together a reprex to illustrate the approach that I settled on - my hope is that it sparks a discussion that then informs a new section of the geometry manipulation article (or perhaps a separate article entirely).
Here's an image (reprex below) that uses the nc dataset to illustrate the result I was looking for:


Update:
Here's my approach wrapped up in a function:
st_intersects_most <- function(x, y){
ints <- st_intersects(x,y)
tibble(
ROW_ID = imap(ints, ~rep(.y, times = length(.x))) %>% flatten_chr(),
INTERSECT_DF = flatten_int(ints) %>% map(~y[.x,] %>% st_drop_geometry), # note: see the reprex for st_drop_geometry's definition
INTERSECT_AREA = st_intersection(x,y) %>% st_area %>% map_dbl(1)
) %>%
unnest %>%
mutate(geometry = st_geometry(x[ROW_ID,])) %>%
st_as_sf
}
It requires the user to follow up with group_by %>% top_n(1, INTERSECT_AREA) %>% ungroup - that could probably be improved.
note: the reprex below has been updated to show the above function in action
Reprex + Session Info
# SETUP ----
library(magrittr)
library(tidyverse)
library(ggplot2) # devtools::install_github('tidyverse/ggplot2) for `geom_sf()`
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(scales)
library(gridExtra)
# LOAD DATA ----
nc_sf <-
st_read(system.file("shape/nc.shp", package="sf")) %>%
st_transform(2264) # NC state plane, US foot
# CREATE GRID ----
grid_sf <-
tibble(geometry = st_make_grid(nc_sf)) %>%
st_sf %>%
bind_cols(crossing(X=1:10,Y=1:10)) %>%
mutate(CENTROID = map(geometry, st_centroid),
COORDS = map(CENTROID, st_coordinates),
COORDS_X = map_dbl(COORDS,1),
COORDS_Y = map_dbl(COORDS,2),
COORD_COMB = map2_dbl(COORDS_X,COORDS_Y,~ sum(abs(.x)*1e5,abs(.y)))) %>%
arrange(desc(COORD_COMB)) %>%
mutate(GRID_ID = purrr::cross2(LETTERS[1:10],str_pad(1:10,2,side = "left",pad = "0")) %>% map_chr(str_c, collapse = "")) %>%
as_tibble() %>%
st_as_sf()
# SPATIAL JOIN FUNCTION ----
st_drop_geometry <- function (x)
{
if (inherits(x, "sf")) {
x <- st_set_geometry(x, NULL)
class(x) <- "data.frame"
x <- as_tibble(x)
}
return(x)
}
st_intersects_most <- function(x, y){
ints <- st_intersects(x,y)
tibble(
ROW_ID = imap(ints, ~rep(.y, times = length(.x))) %>% flatten_chr(),
INTERSECT_DF = flatten_int(ints) %>% map(~y[.x,] %>% st_drop_geometry),
INTERSECT_AREA = st_intersection(x,y) %>% st_area %>% map_dbl(1)
) %>%
unnest %>%
mutate(geometry = st_geometry(x[ROW_ID,])) %>%
st_as_sf
}
# SPATIAL JOIN ----
nc_grid_sf <-
st_intersects_most(nc_sf, grid_sf) %>%
group_by(ROW_ID) %>%
top_n(1, INTERSECT_AREA) %>%
ungroup %>%
select(-CENTROID,-COORDS,-COORDS_X,-COORDS_Y,-COORD_COMB) %>%
mutate(CENTROID = map(geometry, st_centroid),
COORDS = map(CENTROID, st_coordinates),
COORDS_X = map_dbl(COORDS,1),
COORDS_Y = map_dbl(COORDS,2),
COORD_COMB = map2_dbl(COORDS_X,COORDS_Y,~ sum(abs(.x)*1e5,abs(.y))))
## Warning: attribute variables are assumed to be spatially constant
## throughout all geometries
# PLOT RESULTS ----
# Create a discrete color scale for each grid rectangle
scale_fill_grid <- function(...){
cols <- rep(brewer_pal(type = "qual", palette = 3)(12),times = 9)[1:100]
vals <- grid_sf$GRID_ID
ggplot2::scale_fill_manual(...,
values = setNames(cols,
vals))
}
# Grid beside counties
g1 <- ggplot()
g1 <- g1 + geom_sf(data = grid_sf, color = "black", fill = "transparent")
g1 <- g1 + geom_text(data = grid_sf, mapping = aes(COORDS_X, COORDS_Y, label = GRID_ID))
g1 <- g1 + coord_sf(crs = st_crs(grid_sf), datum = NA)
g1 <- g1 + theme_void()
g1 <- g1 + theme(legend.position = "none")
g2 <- ggplot()
g2 <- g2 + geom_sf(data = nc_sf, color = "white")
g2 <- g2 + coord_sf(crs = st_crs(nc_sf), datum = NA)
g2 <- g2 + theme_void()
g2 <- g2 + theme(legend.position = "none")
# Grid over counties
g3 <- ggplot()
g3 <- g3 + geom_sf(data = grid_sf, mapping = aes(fill = GRID_ID), alpha = .5, color = "transparent")
g3 <- g3 + geom_sf(data = nc_sf, color = "black", fill = "transparent")
g3 <- g3 + scale_fill_grid()
g3 <- g3 + geom_text(data = grid_sf, mapping = aes(COORDS_X, COORDS_Y, label = GRID_ID))
g3 <- g3 + coord_sf(crs = st_crs(grid_sf), datum = NA)
g3 <- g3 + theme_void()
g3 <- g3 + theme(legend.position = "none")
g4 <- ggplot()
g4 <- g4 + geom_sf(data = nc_grid_sf, aes(fill = GRID_ID), alpha = .5, color = "white")
g4 <- g4 + scale_fill_grid()
g4 <- g4 + geom_text(data = nc_grid_sf, mapping = aes(COORDS_X, COORDS_Y, label = GRID_ID), size = 2)
g4 <- g4 + geom_sf(data = grid_sf, fill = "transparent", color = "black")
g4 <- g4 + coord_sf(crs = st_crs(grid_sf), datum = NA)
g4 <- g4 + theme_void()
g4 <- g4 + theme(legend.position = "none")
grid.arrange(g2, g1, ncol=1)
grid.arrange(g3, g4, ncol=1)
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Los_Angeles
## date 2017-12-02
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.2)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.2)
## bitops 1.0-6 2013-08-17 CRAN (R 3.4.0)
## broom 0.4.2 2017-02-13 CRAN (R 3.4.0)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.2)
## class 7.3-14 2015-08-30 CRAN (R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN (R 3.4.2)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
## compiler 3.4.0 2017-04-21 local
## crayon 1.3.4 2017-10-30 Github (r-lib/crayon@b5221ab)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.2)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN (R 3.4.2)
## evaluate 0.10 2016-10-11 CRAN (R 3.4.0)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.2)
## foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)
## glue 1.1.1.9000 2017-10-18 Github (tidyverse/glue@808054b)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gridExtra * 2.3 2017-09-09 CRAN (R 3.4.2)
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.2)
## hms 0.3 2016-11-22 CRAN (R 3.4.2)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.2)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.16 2017-05-18 CRAN (R 3.4.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2)
## magrittr * 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.2)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.0 2017-04-21 local
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4.9000 2017-10-24 Github (tidyverse/purrr@fc8bea1)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.1)
## Rcpp 0.12.14 2017-11-23 CRAN (R 3.4.2)
## RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.1)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.4 2017-11-05 CRAN (R 3.4.2)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.2)
## scales * 0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)
## sf * 0.5-5 2017-10-31 CRAN (R 3.4.0)
## stats * 3.4.0 2017-04-21 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.3.4 2017-08-22 CRAN (R 3.4.2)
## tidyr * 0.7.2 2017-10-16 CRAN (R 3.4.2)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.2)
## tools 3.4.0 2017-04-21 local
## udunits2 0.13 2016-11-17 CRAN (R 3.4.1)
## units 0.4-6 2017-08-27 CRAN (R 3.4.2)
## utils * 3.4.0 2017-04-21 local
## withr 2.1.0.9000 2017-12-02 Github (jimhester/withr@fe81c00)
## XML 3.98-1.9 2017-06-19 CRAN (R 3.4.1)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.2)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
That's a nice article, and IMO a candidate to be wrapped into something simpler: a new function or an option under an existing one. Does this activity have a name in GIS or spatial database world?
I wrote sf::st_interpolate_aw for area-weighted interpolation, which is what you'd do when the labels are continuous variables. For me, intuitively I would search for such functionality under st_join.
I revised the example to show how this might work as a function instead of a series of mutate steps.
I'm not sure if this activity has a formal name in GIS. I've been calling it an "area-weighted spatial join" but others have different names for it:
Sometimes called majority fill or maximum area overlay -- but I believe a standard (faster?) approach is to rasterize the base layer to an appropriate resolution and then use raster::extract() to return the mode/median raster value in each polygon.
@tiernanmartin something went wrong, it seems, e.g. when assigning F08 and H04, in your plot. I now integrated a somewhat different approach in st_join, honoring the rest of st_join's conventions wrt renaming duplicate columns, and left option. Output from ?st_join below:

@mbacou "maximum area overlay" sounds good to me and thanks for the the tip about using raster::extract()
@edzer You're right - the function version above contains a mistake. I'll see if I can sleuth it out...
I played around with st_join(..., largest = TRUE) and I'm really pleased with the results. The example in ?st_join is quite clear. Thanks for the quick turn-around with f16249d and I'll let you know if I run into any unexpected behavior.
I haven't tested it with large datasets yet, but I'm wondering if this may be another instance where st_intersection would benefit from vectorization (#575)?
"largest" parameter as is now in st_join would also add a lot to raster::extract(), as an additional option for "fun" https://www.rdocumentation.org/packages/raster/versions/3.1-5/topics/extract
But that is an issue for raster, right?
Most helpful comment
@tiernanmartin something went wrong, it seems, e.g. when assigning

F08andH04, in your plot. I now integrated a somewhat different approach inst_join, honoring the rest ofst_join's conventions wrt renaming duplicate columns, andleftoption. Output from?st_joinbelow: