Sf: How to remove duplicate geometries?

Created on 13 Mar 2018 · 17Comments · Source: r-spatial/sf

I'm trying to remove duplicate geometries, in this case points.
There are several ways to do so: my first idea was to use dplyr::distinct(), but it does not seem to work for geometry columns.

Some examples below:

#Create example dataset
library(sf)
library(dplyr)
d <- structure(list(layer = 274.146911621094, geometry = structure(list(
    `1` = structure(list(structure(c(-3162000, -3150000, -3150000, 
    -3162000, -3162000, 3162000, 3162000, 3150000, 3150000, 3162000
    ), .Dim = c(5L, 2L))), class = c("XY", "POLYGON", "sfg"))), .Names = "1", class = c("sfc_POLYGON", 
"sfc"), precision = 0, bbox = structure(c(-3162000, 3150000, 
-3150000, 3162000), .Names = c("xmin", "ymin", "xmax", "ymax"
), class = "bbox"), crs = structure(list(epsg = NA_integer_, 
    proj4string = "+proj=lcc +lat_1=30 +lat_2=65 +lat_0=48 +lon_0=9.75 +x_0=-6000 +y_0=-6000 +a=6371229 +b=6371229 +units=m +no_defs"), .Names = c("epsg", 
"proj4string"), class = "crs"), n_empty = 0L)), .Names = c("layer", 
"geometry"), row.names = 1L, class = c("sf", "data.frame"), sf_column = "geometry", agr = structure(NA_integer_, .Names = "layer", .Label = c("constant", 
"aggregate", "identity"), class = "factor"))
dpoint <- (st_cast(d, "POINT"))

#Now let's try to eliminate the duplicate point: 4 different ways come to mind
dpoint %>% distinct(geometry) #does nothing   <---- would be my preferred solution
st_intersection(dpoint) #Works, adds columns
st_cast(st_union(dpoint), "POINT") #Works
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) #Works

Now for some performance testing on a larger dataset which I do not attach (10k points, most of which duplicated):

library(microbenchmark)
mb <- microbenchmark(times=10,
st_intersection(dpoint),
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: milliseconds
                                                                                                        expr
                                                                                     st_intersection(dpoint)
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
        min         lq        mean     median          uq         max neval cld
 9683.35427 9777.54914 10036.55727 9975.24093 10205.31676 10667.48132    10   b
  106.32318  108.49353   115.39371  110.64525   111.65030   143.65393    10  a 
   37.90596   38.33749    38.86942   38.61979    39.36003    40.67645    10  a

And on a much larger dataset (1.4M points), for the two fastest methods:

mb <- microbenchmark(times=10,
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: seconds
                                                                                                        expr
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
      min        lq     mean    median        uq       max neval cld
 14.26685 14.531474 15.70820 15.675282 16.283850 18.349245    10   b
  5.06904  5.166566  5.98637  5.647217  6.737356  7.637938    10  a

Is there any faster, more elegant method? Can't dplyr::distinct(geometry) be made to work?

Source

adrfantini

👍2

Most helpful comment

@adrfantini distinct.sf currently drops all non geometry columns if ... isn't specified. This is inconsistent with the dplyr behavior which assumes ... is all variables if omitted.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc2 <- rbind(nc,nc)

nc2 %>% distinct()
#> Simple feature collection with 100 features and 0 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> First 10 features:
#>                          geometry
#> 1  MULTIPOLYGON (((-81.47276 3...
#> 2  MULTIPOLYGON (((-81.23989 3...
#> 3  MULTIPOLYGON (((-80.45634 3...
#> 4  MULTIPOLYGON (((-76.00897 3...
#> 5  MULTIPOLYGON (((-77.21767 3...
#> 6  MULTIPOLYGON (((-76.74506 3...
#> 7  MULTIPOLYGON (((-76.00897 3...
#> 8  MULTIPOLYGON (((-76.56251 3...
#> 9  MULTIPOLYGON (((-78.30876 3...
#> 10 MULTIPOLYGON (((-80.02567 3...

^{Created on 2019-02-27 by the reprex package (v0.2.1.9000)}

kendonB on 27 Feb 2019

👍6

All 17 comments

Yes, this can be made to work. It should (I guess) use st_equals to add a column with a label indicating which geometries are distinct, then use the normal distinct() to work on that column, then replace the labels with geometries.

edzer on 13 Mar 2018

I might be wrong, but I think st_difference(dpoint) drops duplicates.

tim-salabim on 13 Mar 2018

@edzer You mean this could be what a possible sf::distinct() could do? I tested on the 10k lines dataset and st_equals() takes about 500ms. st_difference() (@tim-salabim) also works, but takes ~7.5s.

Compare that to reducing the coordinates to data.frame, using distinct and recreating the sf: that takes ~40 milliseconds.

adrfantini on 13 Mar 2018

@adrfantini yes; I was thinking of a generic distinct.sf method that works for any geometry.

edzer on 13 Mar 2018

That'd be a nice feature to be had. For it to make sense it should be as fast as or faster than transforming to data.frame and distincting on that tho.

adrfantini on 13 Mar 2018

For _that_, I'd look forward to a PR.

edzer on 13 Mar 2018

It looks like unique.data.frame simply works for sf objects; any objections against using that?

edzer on 14 Mar 2018

Turned out there was a distinct.sf method that ignored geometries; I now added one that does the st_equals trick. Looking forward to your benchmarks, and do compare with unique.data.frame too!

edzer on 14 Mar 2018

I'll be glad to. Give me a few days.

adrfantini on 14 Mar 2018

library(raster)
library(spex)
library(sf)
library(stars)
library(dplyr)
library(microbenchmark)
tif <- system.file("tif/L7_ETMs.tif", package = "stars")
sfPolygons <- spex::polygonize(raster(tif)[[1]]) %>% select(geometry)
sfPoints <- st_cast(sfPolygons, "POINT")[1:10000,]
#sfPoints has 5 features for each feature in sfPolygon
(mb <- microbenchmark(times=1,
r1 <- unique.data.frame(sfPoints),
#r2 <- distinct.sf(sfPoints),
r3 <- st_difference(sfPoints),
r4 <- st_intersection(sfPoints),
r5 <- st_cast(st_union(sfPoints), "POINT"),
r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,Y), coords=1:2, crs=st_crs(sfPoints))
))

Result:

Unit: milliseconds
                                                                                                                  expr
                                                                                     r1 <- unique.data.frame(sfPoints)
                                                                                         r3 <- st_difference(sfPoints)
                                                                                       r4 <- st_intersection(sfPoints)
                                                                            r5 <- st_cast(st_union(sfPoints), "POINT")
 r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(sfPoints))
        min         lq       mean     median         uq        max neval
   57.00043   57.00043   57.00043   57.00043   57.00043   57.00043     1
 3944.29434 3944.29434 3944.29434 3944.29434 3944.29434 3944.29434     1
 4900.58573 4900.58573 4900.58573 4900.58573 4900.58573 4900.58573     1
   63.41992   63.41992   63.41992   63.41992   63.41992   63.41992     1
   22.08911   22.08911   22.08911   22.08911   22.08911   22.08911     1

I did not run distinct.sf because Error in NextMethod() : generic function not specified.

adrfantini on 18 Mar 2018

add r7 <- distinct(sfPoints) ?

edzer on 18 Mar 2018

distinct(sfPoints) does not remove duplicates as far as I can tell.

adrfantini on 19 Mar 2018

Embarrassing! Now it should work; note that geometries and attributes both need to be distinct.

edzer on 19 Mar 2018

Different system from before, expect different numbers. I replaced distinct.sf with distinct.

Unit: milliseconds
                                                                                                                  expr
                                                                                     r1 <- unique.data.frame(sfPoints)
                                                                                              r2 <- distinct(sfPoints)
                                                                                         r3 <- st_difference(sfPoints)
                                                                                       r4 <- st_intersection(sfPoints)
                                                                            r5 <- st_cast(st_union(sfPoints), "POINT")
 r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(sfPoints))
        min         lq       mean     median         uq        max neval
   76.70377   76.70377   76.70377   76.70377   76.70377   76.70377     1
  608.87942  608.87942  608.87942  608.87942  608.87942  608.87942     1
 5520.57342 5520.57342 5520.57342 5520.57342 5520.57342 5520.57342     1
 6429.61601 6429.61601 6429.61601 6429.61601 6429.61601 6429.61601     1
   92.78926   92.78926   92.78926   92.78926   92.78926   92.78926     1
   34.16400   34.16400   34.16400   34.16400   34.16400   34.16400     1

adrfantini on 19 Mar 2018

I think the original question of this issue is now suffiiently answered: use distinct (and remove all attributes if you only want to work with geometries).

edzer on 26 Mar 2018

@adrfantini distinct.sf currently drops all non geometry columns if ... isn't specified. This is inconsistent with the dplyr behavior which assumes ... is all variables if omitted.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc2 <- rbind(nc,nc)

nc2 %>% distinct()
#> Simple feature collection with 100 features and 0 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> First 10 features:
#>                          geometry
#> 1  MULTIPOLYGON (((-81.47276 3...
#> 2  MULTIPOLYGON (((-81.23989 3...
#> 3  MULTIPOLYGON (((-80.45634 3...
#> 4  MULTIPOLYGON (((-76.00897 3...
#> 5  MULTIPOLYGON (((-77.21767 3...
#> 6  MULTIPOLYGON (((-76.74506 3...
#> 7  MULTIPOLYGON (((-76.00897 3...
#> 8  MULTIPOLYGON (((-76.56251 3...
#> 9  MULTIPOLYGON (((-78.30876 3...
#> 10 MULTIPOLYGON (((-80.02567 3...

^{Created on 2019-02-27 by the reprex package (v0.2.1.9000)}

kendonB on 27 Feb 2019

👍6

Maybe then this could be reopened

adrfantini on 27 Feb 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings