I had a similar problem as in this SO question and I started thinking that maybe sf could use a filtering function, eg. st_filter().
Since the default error message for when one or both objects are not sf objects is that the crs's don't match, I've also added a more informative error message.
The default method is st_intersects but one can specify another by using eg, st_filter(x, y, st_crosses).
What do you think?
library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
st_filter <- function(.x, .y, f = st_intersects) {
geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
null_geoms <- sapply(geoms_attr, is.null)
if(any(null_geoms)) {
stop("`st_geometry` should not be `NULL`")
}
.x[f(.x, .y, sparse = FALSE), ]
}
nc %>% st_filter(mtcars)
#> Error in st_filter(., mtcars): `st_geometry` should not be `NULL`
my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>%
st_as_sf(coords = c("x", "y")) %>%
summarize() %>%
st_cast("LINESTRING") %>%
st_set_crs(4267)
ggplot(nc) +
geom_sf() +
geom_sf(data = my_line)

nc %>%
st_filter(my_line) %>%
ggplot() +
geom_sf()
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar

Created on 2019-09-12 by the reprex package (v0.3.0)
I like this concept a lot. I grepped a few recent projects and found that more than half of my st_join calls were doing this operation via st_join(..., left = FALSE). An inner join has the same row-subsetting consequences as the proposed st_filter, but the downside is that it drags extra columns along. I often use st_join(..., y = dplyr::select(foo, geometry), left = FALSE) or similar.
A minor note: it's counterintuitive to me, but I've found that st_join(..., left = FALSE) is consistently faster for subsetting than using the geometry predicates like st_intersects directly. Here's an example building on the code above:
library(sf)
library(tidyverse)
nc <- st_read(system.file("shape/nc.shp", package="sf"))
st_filter <- function(.x, .y, f = st_intersects) {
# Leaving the error checks out for a direct comparison in function timings
# geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
# null_geoms <- sapply(geoms_attr, is.null)
# if(any(null_geoms)) {
# stop("`st_geometry` should not be `NULL`")
# }
.x[f(.x, .y, sparse = FALSE), ]
}
my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>%
st_as_sf(coords = c("x", "y")) %>%
summarize() %>%
st_cast("LINESTRING") %>%
st_set_crs(4267)
suppressMessages(suppressWarnings({
microbenchmark::microbenchmark(
sf::st_join(x = nc, y = my_line,
join = sf::st_intersects, left = FALSE),
st_filter(.x = nc, .y = my_line),
times = 100
)
}))
Results are pretty close here on the small dataset:
Unit: milliseconds
expr min lq mean median uq max neval
sf::st_join(x = nc, y = my_line, join = sf::st_intersects, left = FALSE) 3.475222 3.555523 3.813325 3.645118 3.894801 8.765784 100
st_filter(.x = nc, .y = my_line) 3.637339 3.689342 3.877169 3.722460 3.899177 9.239481 100
But the timing gap grows considerably over larger datasets -- median time is ~10% better here.
guilford_roads <- tigris::roads(state = "37", county = "081", class = "sf") %>%
sf::st_transform(4267)
suppressMessages(suppressWarnings({
microbenchmark::microbenchmark(
sf::st_join(x = guilford_roads, y = my_line,
join = sf::st_intersects, left = FALSE),
st_filter(.x = guilford_roads, .y = my_line),
times = 25
)
}))
Unit: milliseconds
expr min lq mean median uq max neval
sf::st_join(x = guilford_roads, y = my_line, join = sf::st_intersects, left = FALSE) 183.3485 187.7024 192.1303 190.482 197.0052 205.0862 25
st_filter(.x = guilford_roads, .y = my_line) 194.5699 200.8530 207.8659 207.605 210.8725 230.3960 25
When subsetting really large sf objects like US Census Block files (~11M detailed shapes), I have found nearly 20% improvements consistently. I can't explain this behavior offhand.
@pasipasi123 that approach does not hold when my_line (or .y) holds more than one feature:
ml = rbind(my_line,my_line)
nc %>% st_filter(ml)
...
Simple feature collection with 16 features and 14 fields (with 8 geometries empty)
...
Two approaches currently work:
nc[ml,] # which also has an op argument to control which predicate to use
and
nc %>% filter(lengths(st_intersects(., ml)) > 0)
Am I right that your proposal is to make st_filter work like the second, but written as
nc %>% st_filter(ml)
with a parameter to control the predicate, defaulting to st_intersects?
Am I right that your proposal is to make
st_filterwork like the second, but written asnc %>% st_filter(ml)with a parameter to control the predicate, defaulting to
st_intersects?
Yes, that's what I was after.
OK, that would then be
st_filter = function(.x, .y, .pred = st_intersects) {
# check that dplyr is loaded, then
filter(.x, lengths(.pred(.x, .y)) > 0)
}
nc %>% st_filter(ml)
@sheffe I think there is a substantial difference: when y has more than one feature, st_join may return more records than x has, filter never would do this.
Most helpful comment
OK, that would then be