Sf: Vectorize st_intersection

Created on 29 Nov 2017  路  30Comments  路  Source: r-spatial/sf

Currently st_intersection(x, y) intersects all geometries in x with all geometries in y. Would it be possible to have an option that can be vectorized to compute intersections between pairs of x and y?

Background for this is that I have a very large sf object and want to check for overlapping polygons. I am first using st_intersect to figure out which polygons have intersections, and want to take the output from that to make pairs of sf objects that I use as the parameters for st_intersection. Right now, I am combining st_intersection with map from purrr, but this has proven to be very computationally demanding.

Any insight would be greatly appreciated!

Most helpful comment

And the combination of both runs in less than a minute for the whole set, just like ArcGIS:

library(sf) 

# files <- download.file("http://www.geoinf.uni-jena.de/~bi28yuv/st_intersection_reprex/files.zip", "files.zip")
# unzip("files.zip")

points <- st_read("points.shp")
polygons <- st_read("poly.shp") %>% st_cast("POLYGON")

system.time({
  tst = st_join(points, polygons)
})

   user  system elapsed 
 32.698   0.012  32.708

head(tst)
Simple feature collection with 6 features and 2 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 532900 ymin: 4702500 xmax: 533500 ymax: 4702700
epsg (SRID):    32630
proj4string:    +proj=utm +zone=30 +datum=WGS84 +units=m +no_defs
  OBJECTID COD_LITOLO               geometry
1        1         01 POINT (533100 4702500)
2        2         08 POINT (533300 4702500)
3        3         01 POINT (533500 4702500)
4        4         01 POINT (532900 4702700)
5        5         01 POINT (533100 4702700)
6        6         08 POINT (533300 4702700)

I think we know what ArcGIS is doing in case of point-polygon intersection - use binary predicates to get intersection ids and join attributes based on these ids.

All 30 comments

Can you provide a reproducible example? I have no idea what you mean by 'very large' but I routinely use st_intersection on layers of all feature types with 200k+ rows and 30+ columns. I have found computing time to be generally acceptable (<10-15 minutes usually).

... and do give your sessionInfo()!

Sorry here's the sessionInfo

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] sf_0.5-6

loaded via a namespace (and not attached):
 [1] magrittr_1.5    class_7.3-14    DBI_0.7         tools_3.3.2     units_0.4-6     Rcpp_0.12.14    udunits2_0.13  
 [8] grid_3.3.2      e1071_1.6-8     texreg_1.36.23  classInt_0.1-24

I can't share the proprietary dataset, but it is has ~500,000 multipolygons/polygons that I want to intersect with itself. Below is an example of how I am currently doing pairwise intersection using 4 simple polygons

library(dplyr)
library(sf)
library(purrr)

# creating the polygons
df1 <- 
  list(matrix(c(2,4,4,1,2, 2,3,5,4,2), ncol = 2), 
       matrix(c(5,4,2,5, 2,3,2,2), ncol = 2)) %>%
  map(function(x) st_polygon(list(x))) %>%
  st_sfc

df2 <- 
  list(matrix(c(5, 3, 2, 5, 2, 3, 2, 2), ncol = 2),
       matrix(c(2, 3, 3, 2, 2, 3, 3, 5, 4, 3), ncol = 2)) %>%
  map(function(x) st_polygon(list(x))) %>%
  st_sfc

# Output
> st_intersection(df1, df2)
Geometry set for 3 features 
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 2 ymin: 2 xmax: 5 ymax: 4.666667
epsg (SRID):    NA
proj4string:    NA
POLYGON ((3.5 2.75, 2 2, 3 3, 3.5 2.75))
POLYGON ((2 2, 3.5 2.75, 5 2, 2 2))
POLYGON ((2.5 4.5, 3 4.66666666666667, 3 3, 2 3...

# What I would like to do for pairwise intersection
> map2(df1, df2, st_intersection)
[[1]]
POLYGON ((3.5 2.75, 2 2, 3 3, 3.5 2.75))

[[2]]
GEOMETRYCOLLECTION EMPTY

I meant the sessionInfo() after loading sf.

I edited the last post to reflect the sessionInfo

Is it right that the st_intersection takes about 20 times as long as st_intersects?

When I run it on the whole dataset, st_intersect takes about 10 min while st_intersection takes more than a day

How much RAM does the machine have?

16 gb. At this point I'm ok letting it just run on my computer overnight - but wanted to take a shot in the dark at seeing if it can be vectorized using st_intersection!

I think your own solution map is the way to go, now.

Here's an approach that uses data.table. I have no idea whether this is actually any faster than your solution using map2, but I would appreciate and be very interested in getting feedback how it performs in your case. We are still learning how to data.table with sf so any experiences with real-world cases are valuable.

library(dplyr)
library(sf)
library(mapview)
library(data.table)

nc = read_sf(system.file("gpkg/nc.gpkg", package="sf"))

sfc1 = st_geometry(nc)
sfc2 = st_jitter(st_geometry(nc), amount = 0.05)

mapview(list(sfc1, sfc2))

dt = data.table(ID = seq_along(sfc1),
                geom1 = sfc1,
                geom2 = sfc2)

dt[, int := list(st_intersection(geom1, geom2)), by = "ID"]

mapview(sfc1[56]) + 
  mapview(sfc2[56], col.regions = "green") + 
  mapview(dt$int[56], col.regions = "red")

Hi @edzer @tim-salabim,

continuing the Twitter discussion here.

Runtime

sf: 1.61 days (points, poly) and 1.63 days (poly, points) on a 2.1 GHz Debian 9 system.
ArcGIS : < 1 min
QGIS: Quit QGIS2 due to being unresponsive after 2h, QGIS3 unchecked yet

Reprex

library(sf)

files <- download.file("http://www.geoinf.uni-jena.de/~bi28yuv/st_intersection_reprex/files.zip", "files.zip")
unzip("files.zip")

points <- st_read("points.shp")
polygons <- st_read("poly.shp")

time <- Sys.time()
out <- st_intersection(points, polygons)
time <- Sys.time() - time
print(time)

Seems like you're trying to rasterize polygons, for which st_intersection might not be the most efficient approach. Maybe try fasterize?

Yes and no.
First, I want to have a point feature to use it for spatial prediction with information from all predictors.That's why I am doing the st_intersection; I essentially just want to extract the polygon information to the points. And for that, st_intersection should be the right tool?

This point feature is gridded because later I'll transform it into a raster with 200x200m so that I have a spatial prediction map.

st_join(points, polygons[1:10, ]) takes about 45 seconds compared to about 6 minutes for st_intersection(points, polygons[1:10, ]). Obviously this ratio may increase or decrease for the whole set. But I think it's worth a try.

The MULTIPOLYGONs are soil types or something like that, and largely overlap. Since the spatial index works from bounding boxes, polygons with mostly overlapping bounding boxes are not efficiently indexed (Tim's example of waterways). Casting object polygons to POLYGON and using that as first argument seems to speed up things quite dramatically:

> p <- polygons %>% st_cast("POLYGON")
Warning message:
In st_cast.sf(., "POLYGON") :
  repeating attributes for all sub-geometries for which they may not be constant
> system.time(out <- st_intersection(p, points[sample(nrow(points),500),]))
   user  system elapsed 
  3.094   0.000   3.094 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 
> system.time(out <- st_intersection(polygons, points[sample(nrow(points),500),]))
   user  system elapsed 
 64.804   0.003  64.808 
Warning message:
attribute variables are assumed to be spatially constant throughout all geometries 

And the combination of both runs in less than a minute for the whole set, just like ArcGIS:

library(sf) 

# files <- download.file("http://www.geoinf.uni-jena.de/~bi28yuv/st_intersection_reprex/files.zip", "files.zip")
# unzip("files.zip")

points <- st_read("points.shp")
polygons <- st_read("poly.shp") %>% st_cast("POLYGON")

system.time({
  tst = st_join(points, polygons)
})

   user  system elapsed 
 32.698   0.012  32.708

head(tst)
Simple feature collection with 6 features and 2 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 532900 ymin: 4702500 xmax: 533500 ymax: 4702700
epsg (SRID):    32630
proj4string:    +proj=utm +zone=30 +datum=WGS84 +units=m +no_defs
  OBJECTID COD_LITOLO               geometry
1        1         01 POINT (533100 4702500)
2        2         08 POINT (533300 4702500)
3        3         01 POINT (533500 4702500)
4        4         01 POINT (532900 4702700)
5        5         01 POINT (533100 4702700)
6        6         08 POINT (533300 4702700)

I think we know what ArcGIS is doing in case of point-polygon intersection - use binary predicates to get intersection ids and join attributes based on these ids.

Awesome!

So lesson learned: Convert MULTIPOLYGON to POLYGON and use st_join in favor of st_intersection if both features overlap (?).

If the runtime difference is that crucial and the outcome is equivalent, maybe a note in the help file for st_intersection would be helpful for users?

I usually use st_join when dealing with point-polygon intersections where geometries don't change. For poly-poly or poly-line etc. it depends on the intended outcome (i.e. if the intersected geometries are needed - st_intersection- or if we merely want to know which intersections exist - st_intersects). The latter is much faster and is used by default in st_join.

Good to know. Could be worth a blog post. Guess I'm not the first and last facing such a problem.
Especially because point-poly intersections are quite common and other users from apps like ArcGIS are used to the "intersection" keyword and therefore are naturally directed to st_intersection instead of st_join.

The st_cast("POLYGON") is nice to additionally speed it up but there should be more awareness of when to use which function (st_join/st_intersection).
This could be reached by a blog post + a note in the help file (or even a warning if a point-poly operation is detected).
I'm sure Edzer will find a good solution :smile:

How to return the output of st_bbox(st_geometry(polygons)[[1]]) as an sfg object, so that lapply() could create a picture of the overlapping envelopes - to make plainer why using precise small envelopes matters for point-in polygon? A side point, is Rcpp::checkUserInterrupt() feasible in geos.cpp? I don't think we did it in rgeos to avoid housekeeping costs.

@rsbivand you can get a sfc with st_as_sfc(st_bbox(polygons[1, ])).
Rcpp::checkUserInterrupt() would be great (I cannot count the times I had to kill the process).

Check geos.cpp: R_CheckUserInterrupt() is in practically every (outer) loop.

@edzer forgot about CAPI. @tim-salabim no progress - need an sf or sfc containing all聽the bbox-derived POLYGON objects. I get to:

library(sf)
mp <- st_read("poly.shp")
o <- lapply(1:nrow(mp), function(i) st_as_sfc(st_bbox(mp[i,])))

but cannot create a single sfc out of n single entity sfc's. Then I want to show how much overlaps for the input gerrymandered case and demonstrate why spatial indexing obviously breaks down when most of the points could be in most of the polygons when indexed by envelope/bbox. The was an idea of a blog or similar, so I thought it would be easy.

Then:

x = do.call(c, o)
i = st_intersection(st_sf(x))
plot(i["n.overlaps"])

ov

This simplification allows you to do

o <- lapply(st_geometry(mp), function(x) st_as_sfc(st_bbox(x)))

Thanks, my mistake was "rbind" not "c" in do.call(). In sp/rgeos, the median bbox overlaps for multipolygons was 47, for polygons was 3. I haven't got to the area proportions being searched multiple times, but the figure as it is shows the root of the problem when doing point in polygon with envelopes overlapping each other.

Right. For POLYGONs:

p <- st_cast(mp, "POLYGON")
op <- lapply(1:nrow(p), function(i) st_as_sfc(st_bbox(p[i,])))
xp = do.call(c, op)
i = st_intersection(st_sf(xp))
plot(i["n.overlaps"], border = NA)

pol

Thanks @rsbivand and @edzer for this very graphic insight into this problem. I think an rspatial blog post about this would be great, as it seems to be a rather common application for users.

Right, my image of the same matches. With the same class intervals as the multipolygon case it is:

image

Was this page helpful?
0 / 5 - 0 ratings

Related issues

thiagoveloso picture thiagoveloso  路  3Comments

dkyleward picture dkyleward  路  4Comments

kendonB picture kendonB  路  3Comments

ekarsten picture ekarsten  路  4Comments

Nowosad picture Nowosad  路  3Comments