I'm trying to remove duplicate geometries, in this case points.
There are several ways to do so: my first idea was to use dplyr::distinct(), but it does not seem to work for geometry columns.
Some examples below:
#Create example dataset
library(sf)
library(dplyr)
d <- structure(list(layer = 274.146911621094, geometry = structure(list(
`1` = structure(list(structure(c(-3162000, -3150000, -3150000,
-3162000, -3162000, 3162000, 3162000, 3150000, 3150000, 3162000
), .Dim = c(5L, 2L))), class = c("XY", "POLYGON", "sfg"))), .Names = "1", class = c("sfc_POLYGON",
"sfc"), precision = 0, bbox = structure(c(-3162000, 3150000,
-3150000, 3162000), .Names = c("xmin", "ymin", "xmax", "ymax"
), class = "bbox"), crs = structure(list(epsg = NA_integer_,
proj4string = "+proj=lcc +lat_1=30 +lat_2=65 +lat_0=48 +lon_0=9.75 +x_0=-6000 +y_0=-6000 +a=6371229 +b=6371229 +units=m +no_defs"), .Names = c("epsg",
"proj4string"), class = "crs"), n_empty = 0L)), .Names = c("layer",
"geometry"), row.names = 1L, class = c("sf", "data.frame"), sf_column = "geometry", agr = structure(NA_integer_, .Names = "layer", .Label = c("constant",
"aggregate", "identity"), class = "factor"))
dpoint <- (st_cast(d, "POINT"))
#Now let's try to eliminate the duplicate point: 4 different ways come to mind
dpoint %>% distinct(geometry) #does nothing <---- would be my preferred solution
st_intersection(dpoint) #Works, adds columns
st_cast(st_union(dpoint), "POINT") #Works
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint)) #Works
Now for some performance testing on a larger dataset which I do not attach (10k points, most of which duplicated):
library(microbenchmark)
mb <- microbenchmark(times=10,
st_intersection(dpoint),
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint))
)
Result:
Unit: milliseconds
expr
st_intersection(dpoint)
st_cast(st_union(dpoint), "POINT")
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(dpoint))
min lq mean median uq max neval cld
9683.35427 9777.54914 10036.55727 9975.24093 10205.31676 10667.48132 10 b
106.32318 108.49353 115.39371 110.64525 111.65030 143.65393 10 a
37.90596 38.33749 38.86942 38.61979 39.36003 40.67645 10 a
And on a much larger dataset (1.4M points), for the two fastest methods:
mb <- microbenchmark(times=10,
st_cast(st_union(dpoint), "POINT"),
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X,Y), coords=1:2, crs=st_crs(dpoint))
)
Result:
Unit: seconds
expr
st_cast(st_union(dpoint), "POINT")
st_as_sf(as.data.frame(st_coordinates(dpoint)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(dpoint))
min lq mean median uq max neval cld
14.26685 14.531474 15.70820 15.675282 16.283850 18.349245 10 b
5.06904 5.166566 5.98637 5.647217 6.737356 7.637938 10 a
Is there any faster, more elegant method? Can't dplyr::distinct(geometry) be made to work?
Yes, this can be made to work. It should (I guess) use st_equals to add a column with a label indicating which geometries are distinct, then use the normal distinct() to work on that column, then replace the labels with geometries.
I might be wrong, but I think st_difference(dpoint) drops duplicates.
@edzer You mean this could be what a possible sf::distinct() could do? I tested on the 10k lines dataset and st_equals() takes about 500ms. st_difference() (@tim-salabim) also works, but takes ~7.5s.
Compare that to reducing the coordinates to data.frame, using distinct and recreating the sf: that takes ~40 milliseconds.
@adrfantini yes; I was thinking of a generic distinct.sf method that works for any geometry.
That'd be a nice feature to be had. For it to make sense it should be as fast as or faster than transforming to data.frame and distincting on that tho.
For _that_, I'd look forward to a PR.
It looks like unique.data.frame simply works for sf objects; any objections against using that?
Turned out there was a distinct.sf method that ignored geometries; I now added one that does the st_equals trick. Looking forward to your benchmarks, and do compare with unique.data.frame too!
I'll be glad to. Give me a few days.
library(raster)
library(spex)
library(sf)
library(stars)
library(dplyr)
library(microbenchmark)
tif <- system.file("tif/L7_ETMs.tif", package = "stars")
sfPolygons <- spex::polygonize(raster(tif)[[1]]) %>% select(geometry)
sfPoints <- st_cast(sfPolygons, "POINT")[1:10000,]
#sfPoints has 5 features for each feature in sfPolygon
(mb <- microbenchmark(times=1,
r1 <- unique.data.frame(sfPoints),
#r2 <- distinct.sf(sfPoints),
r3 <- st_difference(sfPoints),
r4 <- st_intersection(sfPoints),
r5 <- st_cast(st_union(sfPoints), "POINT"),
r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X,Y), coords=1:2, crs=st_crs(sfPoints))
))
Result:
Unit: milliseconds
expr
r1 <- unique.data.frame(sfPoints)
r3 <- st_difference(sfPoints)
r4 <- st_intersection(sfPoints)
r5 <- st_cast(st_union(sfPoints), "POINT")
r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(sfPoints))
min lq mean median uq max neval
57.00043 57.00043 57.00043 57.00043 57.00043 57.00043 1
3944.29434 3944.29434 3944.29434 3944.29434 3944.29434 3944.29434 1
4900.58573 4900.58573 4900.58573 4900.58573 4900.58573 4900.58573 1
63.41992 63.41992 63.41992 63.41992 63.41992 63.41992 1
22.08911 22.08911 22.08911 22.08911 22.08911 22.08911 1
I did not run distinct.sf because Error in NextMethod() : generic function not specified.
add r7 <- distinct(sfPoints) ?
distinct(sfPoints) does not remove duplicates as far as I can tell.
Embarrassing! Now it should work; note that geometries and attributes both need to be distinct.
Different system from before, expect different numbers. I replaced distinct.sf with distinct.
Unit: milliseconds
expr
r1 <- unique.data.frame(sfPoints)
r2 <- distinct(sfPoints)
r3 <- st_difference(sfPoints)
r4 <- st_intersection(sfPoints)
r5 <- st_cast(st_union(sfPoints), "POINT")
r6 <- st_as_sf(as.data.frame(st_coordinates(sfPoints)) %>% distinct(X, Y), coords = 1:2, crs = st_crs(sfPoints))
min lq mean median uq max neval
76.70377 76.70377 76.70377 76.70377 76.70377 76.70377 1
608.87942 608.87942 608.87942 608.87942 608.87942 608.87942 1
5520.57342 5520.57342 5520.57342 5520.57342 5520.57342 5520.57342 1
6429.61601 6429.61601 6429.61601 6429.61601 6429.61601 6429.61601 1
92.78926 92.78926 92.78926 92.78926 92.78926 92.78926 1
34.16400 34.16400 34.16400 34.16400 34.16400 34.16400 1
I think the original question of this issue is now suffiiently answered: use distinct (and remove all attributes if you only want to work with geometries).
@adrfantini distinct.sf currently drops all non geometry columns if ... isn't specified. This is inconsistent with the dplyr behavior which assumes ... is all variables if omitted.
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
nc = read_sf(system.file("shape/nc.shp", package="sf"))
nc2 <- rbind(nc,nc)
nc2 %>% distinct()
#> Simple feature collection with 100 features and 0 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> First 10 features:
#> geometry
#> 1 MULTIPOLYGON (((-81.47276 3...
#> 2 MULTIPOLYGON (((-81.23989 3...
#> 3 MULTIPOLYGON (((-80.45634 3...
#> 4 MULTIPOLYGON (((-76.00897 3...
#> 5 MULTIPOLYGON (((-77.21767 3...
#> 6 MULTIPOLYGON (((-76.74506 3...
#> 7 MULTIPOLYGON (((-76.00897 3...
#> 8 MULTIPOLYGON (((-76.56251 3...
#> 9 MULTIPOLYGON (((-78.30876 3...
#> 10 MULTIPOLYGON (((-80.02567 3...
Created on 2019-02-27 by the reprex package (v0.2.1.9000)
Maybe then this could be reopened
Most helpful comment
@adrfantini
distinct.sfcurrently drops all non geometry columns if...isn't specified. This is inconsistent with the dplyr behavior which assumes...is all variables if omitted.Created on 2019-02-27 by the reprex package (v0.2.1.9000)