Sf: Filtering a sf object could be easier, and also prettier

Created on 12 Sep 2019 · 5Comments · Source: r-spatial/sf

I had a similar problem as in this SO question and I started thinking that maybe sf could use a filtering function, eg. st_filter().

Since the default error message for when one or both objects are not sf objects is that the crs's don't match, I've also added a more informative error message.

The default method is st_intersects but one can specify another by using eg, st_filter(x, y, st_crosses).

What do you think?

library(sf)
library(tidyverse)

nc <- st_read(system.file("shape/nc.shp", package="sf"))

st_filter <- function(.x, .y, f = st_intersects) {
  geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
  null_geoms <- sapply(geoms_attr, is.null)

  if(any(null_geoms)) {
    stop("`st_geometry` should not be `NULL`")
  }

  .x[f(.x, .y, sparse = FALSE), ]
}

nc %>% st_filter(mtcars)
#> Error in st_filter(., mtcars): `st_geometry` should not be `NULL`

my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  summarize() %>% 
  st_cast("LINESTRING") %>% 
  st_set_crs(4267)

ggplot(nc) + 
  geom_sf() + 
  geom_sf(data = my_line)

nc %>% 
  st_filter(my_line) %>% 
  ggplot() + 
  geom_sf()
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar

^{Created on 2019-09-12 by the reprex package (v0.3.0)}

Source

pasipasi123

👍4

Most helpful comment

OK, that would then be

st_filter = function(.x, .y, .pred = st_intersects) {
    # check that dplyr is loaded, then
    filter(.x, lengths(.pred(.x, .y)) > 0)
}

nc %>% st_filter(ml)

edzer on 17 Sep 2019

👍4

All 5 comments

I like this concept a lot. I grepped a few recent projects and found that more than half of my st_join calls were doing this operation via st_join(..., left = FALSE). An inner join has the same row-subsetting consequences as the proposed st_filter, but the downside is that it drags extra columns along. I often use st_join(..., y = dplyr::select(foo, geometry), left = FALSE) or similar.

A minor note: it's counterintuitive to me, but I've found that st_join(..., left = FALSE) is consistently faster for subsetting than using the geometry predicates like st_intersects directly. Here's an example building on the code above:

library(sf)
library(tidyverse)

nc <- st_read(system.file("shape/nc.shp", package="sf"))

st_filter <- function(.x, .y, f = st_intersects) {
  # Leaving the error checks out for a direct comparison in function timings
  # geoms_attr <- sapply(list(.x, .y), attr, "sf_column")
  # null_geoms <- sapply(geoms_attr, is.null)
  # if(any(null_geoms)) {
  #   stop("`st_geometry` should not be `NULL`")
  # }

  .x[f(.x, .y, sparse = FALSE), ]
}

my_line <- tibble(name = "my line", x = c(-81, -79), y = c(34.5, 37)) %>% 
  st_as_sf(coords = c("x", "y")) %>% 
  summarize() %>% 
  st_cast("LINESTRING") %>% 
  st_set_crs(4267)

suppressMessages(suppressWarnings({
  microbenchmark::microbenchmark(
    sf::st_join(x = nc, y = my_line, 
                join = sf::st_intersects, left = FALSE),
    st_filter(.x = nc, .y = my_line),
    times = 100
  )
}))

Results are pretty close here on the small dataset:

Unit: milliseconds
                                                                     expr      min       lq     mean   median       uq      max neval
 sf::st_join(x = nc, y = my_line, join = sf::st_intersects, left = FALSE) 3.475222 3.555523 3.813325 3.645118 3.894801 8.765784   100
                                         st_filter(.x = nc, .y = my_line) 3.637339 3.689342 3.877169 3.722460 3.899177 9.239481   100

But the timing gap grows considerably over larger datasets -- median time is ~10% better here.

guilford_roads <- tigris::roads(state = "37", county = "081", class = "sf") %>%
  sf::st_transform(4267)

suppressMessages(suppressWarnings({
  microbenchmark::microbenchmark(
    sf::st_join(x = guilford_roads, y = my_line, 
                join = sf::st_intersects, left = FALSE),
    st_filter(.x = guilford_roads, .y = my_line),
    times = 25
  )
}))

Unit: milliseconds
                                                                                      expr      min       lq     mean  median       uq      max neval
 sf::st_join(x = guilford_roads, y = my_line, join = sf::st_intersects,      left = FALSE) 183.3485 187.7024 192.1303 190.482 197.0052 205.0862    25
                                              st_filter(.x = guilford_roads, .y = my_line) 194.5699 200.8530 207.8659 207.605 210.8725 230.3960    25

When subsetting really large sf objects like US Census Block files (~11M detailed shapes), I have found nearly 20% improvements consistently. I can't explain this behavior offhand.

sheffe on 17 Sep 2019

@pasipasi123 that approach does not hold when my_line (or .y) holds more than one feature:

ml = rbind(my_line,my_line)
nc %>% st_filter(ml)
...
Simple feature collection with 16 features and 14 fields (with 8 geometries empty)
...

Two approaches currently work:

nc[ml,] # which also has an op argument to control which predicate to use

and

nc %>% filter(lengths(st_intersects(., ml)) > 0)

Am I right that your proposal is to make st_filter work like the second, but written as

nc %>% st_filter(ml)

with a parameter to control the predicate, defaulting to st_intersects?

edzer on 17 Sep 2019

Am I right that your proposal is to make st_filter work like the second, but written as
nc %>% st_filter(ml)
with a parameter to control the predicate, defaulting to st_intersects?

Yes, that's what I was after.

pasipasi123 on 17 Sep 2019

OK, that would then be

st_filter = function(.x, .y, .pred = st_intersects) {
    # check that dplyr is loaded, then
    filter(.x, lengths(.pred(.x, .y)) > 0)
}

nc %>% st_filter(ml)

edzer on 17 Sep 2019

👍4

@sheffe I think there is a substantial difference: when y has more than one feature, st_join may return more records than x has, filter never would do this.

edzer on 17 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings