Sf: Non-spatial join for multiple sf objects

Created on 28 Feb 2017  路  13Comments  路  Source: r-spatial/sf

Since st_join handles spatial joins, can we remove the restriction on dplyr joins that requires the y argument be a data.frame ?

library(tidyverse)
library(sf)
demo(nc, ask = FALSE, verbose = FALSE)

nc_new <- nc %>% select(NAME) %>% mutate(VAR1 = sample(LETTERS[1:3], n(), replace = TRUE))

left_join(nc, nc_new)
#> Error in check_join(x, y): y should be a data.frame; no spatial joins supported yetFALSE

If that restriction is still important, then perhaps this provides further motivation for the request for a convenience function to access non-spatial columns within a sf (#229).

Most helpful comment

@aTnT rbind() seems to work:

library(tidyverse) 
library(sf) 
library(purrr)

nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))

# make two sf object with identical sfc's
nc_A <- nc %>% mutate(GRP = "A") %>% select(NAME, GRP) %>% slice(1:5)
nc_B <- nc %>% mutate(GRP = "B") %>% select(NAME, GRP) %>% slice(1:5)

# have a look
list(nc_A, nc_B)
#> [[1]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>          NAME GRP                           geom
#> 1        Ashe   A MULTIPOLYGON(((-81.47275543...
#> 2   Alleghany   A MULTIPOLYGON(((-81.23989105...
#> 3       Surry   A MULTIPOLYGON(((-80.45634460...
#> 4   Currituck   A MULTIPOLYGON(((-76.00897216...
#> 5 Northampton   A MULTIPOLYGON(((-77.21766662...
#> 
#> [[2]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>          NAME GRP                           geom
#> 1        Ashe   B MULTIPOLYGON(((-81.47275543...
#> 2   Alleghany   B MULTIPOLYGON(((-81.23989105...
#> 3       Surry   B MULTIPOLYGON(((-80.45634460...
#> 4   Currituck   B MULTIPOLYGON(((-76.00897216...
#> 5 Northampton   B MULTIPOLYGON(((-77.21766662...

# rowbind
rbind(nc_A, nc_B) %>% arrange(NAME)
#> Simple feature collection with 10 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>           NAME GRP                           geom
#> 1    Alleghany   A MULTIPOLYGON(((-81.23989105...
#> 2    Alleghany   B MULTIPOLYGON(((-81.23989105...
#> 3         Ashe   A MULTIPOLYGON(((-81.47275543...
#> 4         Ashe   B MULTIPOLYGON(((-81.47275543...
#> 5    Currituck   A MULTIPOLYGON(((-76.00897216...
#> 6    Currituck   B MULTIPOLYGON(((-76.00897216...
#> 7  Northampton   A MULTIPOLYGON(((-77.21766662...
#> 8  Northampton   B MULTIPOLYGON(((-77.21766662...
#> 9        Surry   A MULTIPOLYGON(((-80.45634460...
#> 10       Surry   B MULTIPOLYGON(((-80.45634460...

Using dplyr::bind_rows() instead of rbind() returns the Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index error, so it would be nice if that could be resolved. But in the meantime rbind seems adequate - or perhaps I'm misunderstanding your situation?

All 13 comments

Regarding importance: can you come up with a use case where you'd would want to join on both attributes _and_ geometries? What would the spatial predicate be to join on (intersects, equals)? We could integrate st_join in the dplyr *_join functions, but wouldn't that confuse users?

Regarding convenience:
We already have several other replacement functions in two forms, like

st_precision(x) = 1000
x %>% st_set_precision(1000)

So as a convenience function for dropping geometry I suggest either

nc_new <- nc %>% st_set_geometry(NULL)

or its alias

nc_new <- nc %>% st_drop_geometry()

Still, I feel only a marginal gain over

nc %>% data.frame() %>% select(-geom)

any other ideas?

There may be scenarios where joining by attributes and geoms is desirable, but I haven't been able to come up with an example in my work.

My personal preference is for the *_join functions to not interact with the geometry information at all - let st_join handle those operations.

Here's an example where I have a set of sf objects that I'd like to safely combine using an attribute:

library(tidyverse)
library(sf)
library(purrr)

demo(nc, ask = FALSE, verbose = FALSE)

set.seed(1)

nc_1 <- nc %>% sample_frac(0.75) %>% select(NAME, BIR74)
nc_2 <- nc %>% sample_frac(0.75) %>% select(NAME, SID74)
nc_3 <- nc %>% sample_frac(0.75) %>% select(NAME, BIR79)
nc_4 <- nc %>% sample_frac(0.75) %>% select(NAME, SID79)

list(nc_1, nc_2, nc_3, nc_4) %>% reduce(left_join, by = "NAME")
#> Error in check_join(x, y): y should be a data.frame; no spatial joins supported yetFALSE

Right now I would have to add a conversion step to drop the geometries, which is annoying but not a deal-breaker:

list(nc_2, nc_3, nc_4) %>% 
map(~ select(data.frame(.x), -geom)) %>% 
prepend(list(nc_1)) %>% 
  reduce(left_join, by = "NAME")

#> Simple feature collection with 75 features and 5 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 34.06203 xmax: -75.7637 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> First 20 features:
#>          NAME BIR74 SID74 BIR79 SID79                           geom
#> 1    Alamance  4672    13    NA    11 MULTIPOLYGON(((-79.24619293...
#> 2        Wake 14484    16    NA    31 MULTIPOLYGON(((-78.92107391...
#> 3    Beaufort  2692     7    NA     4 MULTIPOLYGON(((-77.10376739...
#> 4    Richmond  2756     4    NA     7 MULTIPOLYGON(((-79.68595886...
#> 5  Perquimans   484     1   676    NA MULTIPOLYGON(((-76.48052978...
#> 6        Hoke  1494     7    NA    NA MULTIPOLYGON(((-79.34030151...
#> 7      Pender  1228    NA  1602     3 MULTIPOLYGON(((-78.02592468...
#> 8       Wayne  6638    18  8227    23 MULTIPOLYGON(((-78.16319274...
#> 9       Swain   675     3   883     2 MULTIPOLYGON(((-83.33181762...
#> 10   Hertford  1452     7  1838     5 MULTIPOLYGON(((-76.74506378...
#> 11    Watauga  1323     1    NA    NA MULTIPOLYGON(((-81.80622100...
#> 12    Halifax  3608    NA  4463    17 MULTIPOLYGON(((-77.33220672...
#> 13 Rutherford  2992    12  3543    NA MULTIPOLYGON(((-81.97144317...
#> 14   Caldwell  3609     6  4249     9 MULTIPOLYGON(((-81.32813262...
#> 15      Moore  2648     5  3534     5 MULTIPOLYGON(((-79.60746765...
#> 16      Burke  3573     5  4314    15 MULTIPOLYGON(((-81.81628417...
#> 17     Duplin  2483     4  2777     7 MULTIPOLYGON(((-77.68983459...
#> 18      Jones   578     1   650     2 MULTIPOLYGON(((-77.04900360...
#> 19   Mitchell   671     0   919    NA MULTIPOLYGON(((-82.11885070...
#> 20    Harnett  3776     6    NA    10 MULTIPOLYGON(((-78.61273956...

st_drop_geometry would simplify that workflow slightly, but my preference would still be for *_join verbs to only join by attributes.

The problem is that after left_join by NAME, we end up with a data.frame with two geometries, geom.x and geom.y. What sf could do here is return a data.frame, and let you sort out the mess.

Or can we be certain that we want an sf with geometry geom.x, since this is a left_join? If so, what to do for all other *_joins?

I've given this some thought but haven't been able to come up with an approach that is reasonable for all *_join verbs. It'd be good for other sf users to weigh in on this.

How are sf objects with multiple sfc's currently handled? Return a data.frame (or tibble) with a warning that multiple spatial columns are present?

Regarding multiple geometry columns, see also #183; the sf api currently assumes single geometry columns: when you engineer your multiple geometry column objects, sf won't help you manage them.

I like your idea for an "active" sf column.

Applying that concept to the issue of *_joins between sfs, would an acceptable default be to set the first *_join argument's sfc as the "active" column and prompt the user with a message?

Then users could make adjustments down-pipe like so:

left_join(nc, nc_new) %>%
  st_sf_column(geom.y)

My gut feeling: left_join should take the first, right_join the second, and for full_join you'd pray it doesn't matter?

Agreed on left_join and right_join!

For inner_join and full_join, a reasonable default might be to keep both sfcs, designate the first as "active", and alert the user with a message. If the user wants to activate another sfc they can do so with st_sf_column, or they could create a new column that appropriately combines the sfcs and activate that one.

+1 for this feature. I am stuck with the full_join of two sf objects that are identical (same shape, column names and types). I have tried bind_rows and postgis queries, but all unsuccessful so far.

@tiernanmartin Thanks for sharing the snippet above for the left_join. How would you adapt it to concatenate two identical sf objects (bind rows of object 1 into object 2)? Any help would be appreciated!

@aTnT rbind() seems to work:

library(tidyverse) 
library(sf) 
library(purrr)

nc <- st_read(system.file("gpkg/nc.gpkg", package = "sf"))

# make two sf object with identical sfc's
nc_A <- nc %>% mutate(GRP = "A") %>% select(NAME, GRP) %>% slice(1:5)
nc_B <- nc %>% mutate(GRP = "B") %>% select(NAME, GRP) %>% slice(1:5)

# have a look
list(nc_A, nc_B)
#> [[1]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>          NAME GRP                           geom
#> 1        Ashe   A MULTIPOLYGON(((-81.47275543...
#> 2   Alleghany   A MULTIPOLYGON(((-81.23989105...
#> 3       Surry   A MULTIPOLYGON(((-80.45634460...
#> 4   Currituck   A MULTIPOLYGON(((-76.00897216...
#> 5 Northampton   A MULTIPOLYGON(((-77.21766662...
#> 
#> [[2]]
#> Simple feature collection with 5 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>          NAME GRP                           geom
#> 1        Ashe   B MULTIPOLYGON(((-81.47275543...
#> 2   Alleghany   B MULTIPOLYGON(((-81.23989105...
#> 3       Surry   B MULTIPOLYGON(((-80.45634460...
#> 4   Currituck   B MULTIPOLYGON(((-76.00897216...
#> 5 Northampton   B MULTIPOLYGON(((-77.21766662...

# rowbind
rbind(nc_A, nc_B) %>% arrange(NAME)
#> Simple feature collection with 10 features and 2 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#>           NAME GRP                           geom
#> 1    Alleghany   A MULTIPOLYGON(((-81.23989105...
#> 2    Alleghany   B MULTIPOLYGON(((-81.23989105...
#> 3         Ashe   A MULTIPOLYGON(((-81.47275543...
#> 4         Ashe   B MULTIPOLYGON(((-81.47275543...
#> 5    Currituck   A MULTIPOLYGON(((-76.00897216...
#> 6    Currituck   B MULTIPOLYGON(((-76.00897216...
#> 7  Northampton   A MULTIPOLYGON(((-77.21766662...
#> 8  Northampton   B MULTIPOLYGON(((-77.21766662...
#> 9        Surry   A MULTIPOLYGON(((-80.45634460...
#> 10       Surry   B MULTIPOLYGON(((-80.45634460...

Using dplyr::bind_rows() instead of rbind() returns the Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index error, so it would be nice if that could be resolved. But in the meantime rbind seems adequate - or perhaps I'm misunderstanding your situation?

@tiernanmartin Thanks for helping, your solution with rbind works, and is exactly what i was looking for! For record, I also get the same error with bind_rows.

See #50

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kendonB picture kendonB  路  4Comments

kendonB picture kendonB  路  4Comments

duleise picture duleise  路  3Comments

kendonB picture kendonB  路  3Comments

thiagoveloso picture thiagoveloso  路  3Comments