Since st_join handles spatial joins, can we remove the restriction on dplyr joins that requires the y argument be a data.frame ?
library(tidyverse)
library(sf)
demo(nc, ask = FALSE, verbose = FALSE)
nc_new <- nc %>% select(NAME) %>% mutate(VAR1 = sample(LETTERS[1:3], n(), replace = TRUE))
left_join(nc, nc_new)
#> Error in check_join(x, y): y should be a data.frame; no spatial joins supported yetFALSE
If that restriction is still important, then perhaps this provides further motivation for the request for a convenience function to access non-spatial columns within a sf (#229).
Regarding importance: can you come up with a use case where you'd would want to join on both attributes _and_ geometries? What would the spatial predicate be to join on (intersects, equals)? We could integrate st_join in the dplyr *_join functions, but wouldn't that confuse users?
Regarding convenience:
We already have several other replacement functions in two forms, like
st_precision(x) = 1000
x %>% st_set_precision(1000)
So as a convenience function for dropping geometry I suggest either
nc_new <- nc %>% st_set_geometry(NULL)
or its alias
nc_new <- nc %>% st_drop_geometry()
Still, I feel only a marginal gain over
nc %>% data.frame() %>% select(-geom)
any other ideas?
There may be scenarios where joining by attributes and geoms is desirable, but I haven't been able to come up with an example in my work.
My personal preference is for the *_join functions to not interact with the geometry information at all - let st_join handle those operations.
Here's an example where I have a set of sf objects that I'd like to safely combine using an attribute:
library(tidyverse)
library(sf)
library(purrr)
demo(nc, ask = FALSE, verbose = FALSE)
set.seed(1)
nc_1 <- nc %>% sample_frac(0.75) %>% select(NAME, BIR74)
nc_2 <- nc %>% sample_frac(0.75) %>% select(NAME, SID74)
nc_3 <- nc %>% sample_frac(0.75) %>% select(NAME, BIR79)
nc_4 <- nc %>% sample_frac(0.75) %>% select(NAME, SID79)
list(nc_1, nc_2, nc_3, nc_4) %>% reduce(left_join, by = "NAME")
#> Error in check_join(x, y): y should be a data.frame; no spatial joins supported yetFALSE
Right now I would have to add a conversion step to drop the geometries, which is annoying but not a deal-breaker:
list(nc_2, nc_3, nc_4) %>%
map(~ select(data.frame(.x), -geom)) %>%
prepend(list(nc_1)) %>%
reduce(left_join, by = "NAME")
#> Simple feature collection with 75 features and 5 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 34.06203 xmax: -75.7637 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> First 20 features:
#> NAME BIR74 SID74 BIR79 SID79 geom
#> 1 Alamance 4672 13 NA 11 MULTIPOLYGON(((-79.24619293...
#> 2 Wake 14484 16 NA 31 MULTIPOLYGON(((-78.92107391...
#> 3 Beaufort 2692 7 NA 4 MULTIPOLYGON(((-77.10376739...
#> 4 Richmond 2756 4 NA 7 MULTIPOLYGON(((-79.68595886...
#> 5 Perquimans 484 1 676 NA MULTIPOLYGON(((-76.48052978...
#> 6 Hoke 1494 7 NA NA MULTIPOLYGON(((-79.34030151...
#> 7 Pender 1228 NA 1602 3 MULTIPOLYGON(((-78.02592468...
#> 8 Wayne 6638 18 8227 23 MULTIPOLYGON(((-78.16319274...
#> 9 Swain 675 3 883 2 MULTIPOLYGON(((-83.33181762...
#> 10 Hertford 1452 7 1838 5 MULTIPOLYGON(((-76.74506378...
#> 11 Watauga 1323 1 NA NA MULTIPOLYGON(((-81.80622100...
#> 12 Halifax 3608 NA 4463 17 MULTIPOLYGON(((-77.33220672...
#> 13 Rutherford 2992 12 3543 NA MULTIPOLYGON(((-81.97144317...
#> 14 Caldwell 3609 6 4249 9 MULTIPOLYGON(((-81.32813262...
#> 15 Moore 2648 5 3534 5 MULTIPOLYGON(((-79.60746765...
#> 16 Burke 3573 5 4314 15 MULTIPOLYGON(((-81.81628417...
#> 17 Duplin 2483 4 2777 7 MULTIPOLYGON(((-77.68983459...
#> 18 Jones 578 1 650 2 MULTIPOLYGON(((-77.04900360...
#> 19 Mitchell 671 0 919 NA MULTIPOLYGON(((-82.11885070...
#> 20 Harnett 3776 6 NA 10 MULTIPOLYGON(((-78.61273956...
st_drop_geometry would simplify that workflow slightly, but my preference would still be for *_join verbs to only join by attributes.
The problem is that after left_join by NAME, we end up with a data.frame with two geometries, geom.x and geom.y. What sf could do here is return a data.frame, and let you sort out the mess.
Or can we be certain that we want an sf with geometry geom.x, since this is a left_join? If so, what to do for all other *_joins?
I've given this some thought but haven't been able to come up with an approach that is reasonable for all *_join verbs. It'd be good for other sf users to weigh in on this.
How are sf objects with multiple sfc's currently handled? Return a data.frame (or tibble) with a warning that multiple spatial columns are present?
Regarding multiple geometry columns, see also #183; the sf api currently assumes single geometry columns: when you engineer your multiple geometry column objects, sf won't help you manage them.
I like your idea for an "active" sf column.
Applying that concept to the issue of *_joins between sfs, would an acceptable default be to set the first *_join argument's sfc as the "active" column and prompt the user with a message?
Then users could make adjustments down-pipe like so:
left_join(nc, nc_new) %>%
st_sf_column(geom.y)
My gut feeling: left_join should take the first, right_join the second, and for full_join you'd pray it doesn't matter?
Agreed on left_join and right_join!
For inner_join and full_join, a reasonable default might be to keep both sfcs, designate the first as "active", and alert the user with a message. If the user wants to activate another sfc they can do so with st_sf_column, or they could create a new column that appropriately combines the sfcs and activate that one.
+1 for this feature. I am stuck with the full_join of two sf objects that are identical (same shape, column names and types). I have tried bind_rows and postgis queries, but all unsuccessful so far.
@tiernanmartin Thanks for sharing the snippet above for the left_join. How would you adapt it to concatenate two identical sf objects (bind rows of object 1 into object 2)? Any help would be appreciated!
@aTnT rbind() seems to work:
library(tidyverse)
library(sf)
library(purrr)
nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))
# make two sf object with identical sfc's
nc_A <- nc %>% mutate(GRP = "A") %>% select(NAME, GRP) %>% slice(1:5)
nc_B <- nc %>% mutate(GRP = "B") %>% select(NAME, GRP) %>% slice(1:5)
# have a look
list(nc_A, nc_B)
#> [[1]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> NAME GRP geom
#> 1 Ashe A MULTIPOLYGON(((-81.47275543...
#> 2 Alleghany A MULTIPOLYGON(((-81.23989105...
#> 3 Surry A MULTIPOLYGON(((-80.45634460...
#> 4 Currituck A MULTIPOLYGON(((-76.00897216...
#> 5 Northampton A MULTIPOLYGON(((-77.21766662...
#>
#> [[2]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> NAME GRP geom
#> 1 Ashe B MULTIPOLYGON(((-81.47275543...
#> 2 Alleghany B MULTIPOLYGON(((-81.23989105...
#> 3 Surry B MULTIPOLYGON(((-80.45634460...
#> 4 Currituck B MULTIPOLYGON(((-76.00897216...
#> 5 Northampton B MULTIPOLYGON(((-77.21766662...
# rowbind
rbind(nc_A, nc_B) %>% arrange(NAME)
#> Simple feature collection with 10 features and 2 fields
#> geometry type: MULTIPOLYGON
#> dimension: XY
#> bbox: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
#> epsg (SRID): 4267
#> proj4string: +proj=longlat +datum=NAD27 +no_defs
#> NAME GRP geom
#> 1 Alleghany A MULTIPOLYGON(((-81.23989105...
#> 2 Alleghany B MULTIPOLYGON(((-81.23989105...
#> 3 Ashe A MULTIPOLYGON(((-81.47275543...
#> 4 Ashe B MULTIPOLYGON(((-81.47275543...
#> 5 Currituck A MULTIPOLYGON(((-76.00897216...
#> 6 Currituck B MULTIPOLYGON(((-76.00897216...
#> 7 Northampton A MULTIPOLYGON(((-77.21766662...
#> 8 Northampton B MULTIPOLYGON(((-77.21766662...
#> 9 Surry A MULTIPOLYGON(((-80.45634460...
#> 10 Surry B MULTIPOLYGON(((-80.45634460...
Using dplyr::bind_rows() instead of rbind() returns the Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index error, so it would be nice if that could be resolved. But in the meantime rbind seems adequate - or perhaps I'm misunderstanding your situation?
@tiernanmartin Thanks for helping, your solution with rbind works, and is exactly what i was looking for! For record, I also get the same error with bind_rows.
See #50
Most helpful comment
@aTnT
rbind()seems to work:Using
dplyr::bind_rows()instead ofrbind()returns theError in .subset2(x, i, exact = exact) : attempt to select less than one element in get1indexerror, so it would be nice if that could be resolved. But in the meantimerbindseems adequate - or perhaps I'm misunderstanding your situation?