Sf: How to remove duplicate geometries?

Created on 13 Mar 2018  路  17Comments  路  Source: r-spatial/sf

I'm trying to remove duplicate geometries, in this case points.
There are several ways to do so: my first idea was to use dplyr::distinct(), but it does not seem to work for geometry columns.

Some examples below:

#Create example dataset
library(sf)
library(dplyr)
d <- structure(list(layer = 274.146911621094, geometry = structure(list(
    `1` = structure(list(structure(c(-3162000, -3150000, -3150000, 
    -3162000, -3162000, 3162000, 3162000, 3150000, 3150000, 3162000
    ), .Dim = c(5L, 2L))), class = c("XY", "POLYGON", "sfg"))), .Names = "1", class = c("sfc_POLYGON", 
"sfc"), precision = 0, bbox = structure(c(-3162000, 3150000, 
-3150000, 3162000), .Names = c("xmin", "ymin", "xmax", "ymax"
), class = "bbox"), crs = structure(list(epsg = NA_integer_, 
    proj4string = "+proj=lcc +lat_1=30 +lat_2=65 +lat_0=48 +lon_0=9.75 +x_0=-6000 +y_0=-6000 +a=6371229 +b=6371229 +units=m +no_defs"), .Names = c("epsg", 
"proj4string"), class = "crs"), n_empty = 0L)), .Names = c("layer", 
"geometry"), row.names = 1L, class = c("sf", "data.frame"), sf_column = "geometry", agr = structure(NA_integer_, .Names = "layer", .Label = c("constant", 
"aggregate", "identity"), class = "factor"))
dpoint <- (st_cast(d, "POINT"))

#Now let's try to eliminate the duplicate point: 4 different ways come to mind
dpoint %>% distinct(geometry) #does nothing   <---- would be my preferred solution
st_intersection(dpoint) #Works, adds columns
st_cast(st_union(dpoint), "POINT") #Works
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) #Works

Now for some performance testing on a larger dataset which I do not attach (10k points, most of which duplicated):

library(microbenchmark)
mb <- microbenchmark(times=10,
st_intersection(dpoint),
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: milliseconds
                                                                                                        expr
                                                                                     st_intersection(dpoint)
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
        min         lq        mean     median          uq         max neval cld
 9683.35427 9777.54914 10036.55727 9975.24093 10205.31676 10667.48132    10   b
  106.32318  108.49353   115.39371  110.64525   111.65030   143.65393    10  a 
   37.90596   38.33749    38.86942   38.61979    39.36003    40.67645    10  a

And on a much larger dataset (1.4M points), for the two fastest methods:

mb <- microbenchmark(times=10,
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) 
)

Result:

Unit: seconds
                                                                                                        expr
                                                                          st_cast(st_union(dpoint), "POINT")
 st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(dpoint))
      min        lq     mean    median        uq       max neval cld
 14.26685 14.531474 15.70820 15.675282 16.283850 18.349245    10   b
  5.06904  5.166566  5.98637  5.647217  6.737356  7.637938    10  a

Is there any faster, more elegant method? Can't dplyr::distinct(geometry) be made to work?

Most helpful comment

@adrfantini distinct.sf currently drops all non geometry columns if ... isn't specified. This is inconsistent with the dplyr behavior which assumes ... is all variables if omitted.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc2 <- rbind(nc,nc)

nc2 %>% distinct()
#> Simple feature collection with 100 features and 0 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> First 10 features:
#>                          geometry
#> 1  MULTIPOLYGON (((-81.47276 3...
#> 2  MULTIPOLYGON (((-81.23989 3...
#> 3  MULTIPOLYGON (((-80.45634 3...
#> 4  MULTIPOLYGON (((-76.00897 3...
#> 5  MULTIPOLYGON (((-77.21767 3...
#> 6  MULTIPOLYGON (((-76.74506 3...
#> 7  MULTIPOLYGON (((-76.00897 3...
#> 8  MULTIPOLYGON (((-76.56251 3...
#> 9  MULTIPOLYGON (((-78.30876 3...
#> 10 MULTIPOLYGON (((-80.02567 3...

Created on 2019-02-27 by the reprex package (v0.2.1.9000)

All 17 comments

Yes, this can be made to work. It should (I guess) use st_equals to add a column with a label indicating which geometries are distinct, then use the normal distinct() to work on that column, then replace the labels with geometries.

I might be wrong, but I think st_difference(dpoint) drops duplicates.

@edzer You mean this could be what a possible sf::distinct() could do? I tested on the 10k lines dataset and st_equals() takes about 500ms. st_difference() (@tim-salabim) also works, but takes ~7.5s.

Compare that to reducing the coordinates to data.frame, using distinct and recreating the sf: that takes ~40 milliseconds.

@adrfantini yes; I was thinking of a generic distinct.sf method that works for any geometry.

That'd be a nice feature to be had. For it to make sense it should be as fast as or faster than transforming to data.frame and distincting on that tho.

For _that_, I'd look forward to a PR.

It looks like unique.data.frame simply works for sf objects; any objections against using that?

Turned out there was a distinct.sf method that ignored geometries; I now added one that does the st_equals trick. Looking forward to your benchmarks, and do compare with unique.data.frame too!

I'll be glad to. Give me a few days.

library(raster)
library(spex)
library(sf)
library(stars)
library(dplyr)
library(microbenchmark)
tif <- system.file("tif/L7_ETMs.tif", package = "stars")
sfPolygons <- spex::polygonize(raster(tif)[[1]]) %>% select(geometry)
sfPoints <- st_cast(sfPolygons, "POINT")[1:10000,]
#sfPoints has 5 features for each feature in sfPolygon
(mb <- microbenchmark(times=1,
r1 <- unique.data.frame(sfPoints),
#r2 <- distinct.sf(sfPoints),
r3 <- st_difference(sfPoints),
r4 <- st_intersection(sfPoints),
r5 <- st_cast(st_union(sfPoints), "POINT"),
r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,Y), coords=1:2, crs=st_crs(sfPoints))
))

Result:

Unit: milliseconds
                                                                                                                  expr
                                                                                     r1 <- unique.data.frame(sfPoints)
                                                                                         r3 <- st_difference(sfPoints)
                                                                                       r4 <- st_intersection(sfPoints)
                                                                            r5 <- st_cast(st_union(sfPoints), "POINT")
 r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(sfPoints))
        min         lq       mean     median         uq        max neval
   57.00043   57.00043   57.00043   57.00043   57.00043   57.00043     1
 3944.29434 3944.29434 3944.29434 3944.29434 3944.29434 3944.29434     1
 4900.58573 4900.58573 4900.58573 4900.58573 4900.58573 4900.58573     1
   63.41992   63.41992   63.41992   63.41992   63.41992   63.41992     1
   22.08911   22.08911   22.08911   22.08911   22.08911   22.08911     1

I did not run distinct.sf because Error in NextMethod() : generic function not specified.

add r7 <- distinct(sfPoints) ?

distinct(sfPoints) does not remove duplicates as far as I can tell.

Embarrassing! Now it should work; note that geometries and attributes both need to be distinct.

Different system from before, expect different numbers. I replaced distinct.sf with distinct.

Unit: milliseconds
                                                                                                                  expr
                                                                                     r1 <- unique.data.frame(sfPoints)
                                                                                              r2 <- distinct(sfPoints)
                                                                                         r3 <- st_difference(sfPoints)
                                                                                       r4 <- st_intersection(sfPoints)
                                                                            r5 <- st_cast(st_union(sfPoints), "POINT")
 r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,      Y), coords = 1:2, crs = st_crs(sfPoints))
        min         lq       mean     median         uq        max neval
   76.70377   76.70377   76.70377   76.70377   76.70377   76.70377     1
  608.87942  608.87942  608.87942  608.87942  608.87942  608.87942     1
 5520.57342 5520.57342 5520.57342 5520.57342 5520.57342 5520.57342     1
 6429.61601 6429.61601 6429.61601 6429.61601 6429.61601 6429.61601     1
   92.78926   92.78926   92.78926   92.78926   92.78926   92.78926     1
   34.16400   34.16400   34.16400   34.16400   34.16400   34.16400     1

I think the original question of this issue is now suffiiently answered: use distinct (and remove all attributes if you only want to work with geometries).

@adrfantini distinct.sf currently drops all non geometry columns if ... isn't specified. This is inconsistent with the dplyr behavior which assumes ... is all variables if omitted.

library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
nc = read_sf(system.file("shape/nc.shp", package="sf"))

nc2 <- rbind(nc,nc)

nc2 %>% distinct()
#> Simple feature collection with 100 features and 0 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> First 10 features:
#>                          geometry
#> 1  MULTIPOLYGON (((-81.47276 3...
#> 2  MULTIPOLYGON (((-81.23989 3...
#> 3  MULTIPOLYGON (((-80.45634 3...
#> 4  MULTIPOLYGON (((-76.00897 3...
#> 5  MULTIPOLYGON (((-77.21767 3...
#> 6  MULTIPOLYGON (((-76.74506 3...
#> 7  MULTIPOLYGON (((-76.00897 3...
#> 8  MULTIPOLYGON (((-76.56251 3...
#> 9  MULTIPOLYGON (((-78.30876 3...
#> 10 MULTIPOLYGON (((-80.02567 3...

Created on 2019-02-27 by the reprex package (v0.2.1.9000)

Maybe then this could be reopened

Was this page helpful?
0 / 5 - 0 ratings

Related issues

gregmacfarlane picture gregmacfarlane  路  4Comments

jmsigner picture jmsigner  路  4Comments

dpprdan picture dpprdan  路  4Comments

faridcher picture faridcher  路  4Comments

kendonB picture kendonB  路  4Comments