I used list-columns and map + unnest to merge multiple shapefiles into one sf object. But it recently doesn't work any longer (it worked in March 2019). I'm not sure if it's related to sf or tidyr.
Reprex :
library(tidyverse)
library(sf)
library(fs)
library(httr)
# some shapefiles from https://fr.actualitix.com/blog/shapefiles-des-departements-de-france.html
url <- c("https://fr.actualitix.com/blog/actgeoshap/01-Ain.zip",
"https://fr.actualitix.com/blog/actgeoshap/73-savoie.zip",
"https://fr.actualitix.com/blog/actgeoshap/74-haute-savoie.zip")
dep <- str_extract(url, "\\d{2}.*$")
list(url, dep) %>%
pwalk(~ GET(.x, write_disk(.y)))
walk(dep, unzip, junkpaths = TRUE, exdir = "shp")
res <- dir_ls("shp", glob = "*.shp") %>%
tibble(fname = .) %>%
mutate(data = map(fname, read_sf)) %>%
unnest(data) %>%
st_as_sf()
# Error: No common type for `..1$data$geometry` <sfc_POLYGON> and `..2$data$geometry` <sfc_MULTIPOLYGON>.
# Call `rlang::last_error()` to see a backtrace
rlang::last_error()
# <error>
# message: No common type for `..1$data$geometry` <sfc_ # Error in rbind.data.frame(...) :
# numbers of columns of arguments do not match> and `..2$data$geometry` <sfc_MULTIPOLYGON>.
# class: `vctrs_error_incompatible_type`
# backtrace:
# 1. fs::dir_ls("shp", glob = "*.shp")
# 16. tibble::tibble(fname = .)
# 17. dplyr::mutate(., data = map(fname, read_sf))
# 18. tidyr::unnest(., data)
# 19. vctrs::stop_incompatible_type(x, y, x_arg = x_arg, y_arg = y_arg)
# 1. vctrs:::stop_incompatible(...)
# 16. vctrs:::stop_vctrs(...)
I don't know the inner workings of vctrs... Should sf "advertise" in some way that POLYGON and MULTIPOLYGON are compatible (are they ?) ?
Any other method to easily merge multiple layers ?
Since the structures of the files are slightly different (column numbers) I can't use rbind :
res <- dir_ls("shp", glob = "*.shp") %>%
map(read_sf) %>%
do.call(rbind, .)
# Error in rbind.data.frame(...) :
# numbers of columns of arguments do not match
Thanks.
even if we force them to be all read as MULTIPOLYGON, using type = 6 in st_read(), we get
res <- dir_ls("shp", glob = "*.shp") %>%
tibble(fname = .) %>%
mutate(data = map(fname, read_sf, type = 6)) %>%
unnest(data)
# Error: No common type for `..1$data$geometry` <sfc_MULTIPOLYGON> and `..2$data$geometry` <sfc_MULTIPOLYGON>.
Removing the geometry column makes this work.
As this might indeed be a vctrs issue (or my lack of understanding it), do @lionel- or @DavisVaughan have any idea?
You need to write vec_ptype2() and vec_cast() methods for your classes, see the S3 vignettes on the vctrs website.
We're planning a drastic simplification of the way a common type is declared (ptype2 methods). In the mean time you'll need to write the double dispatch methods.
You might also need to define vec-proxy and vec-restore methods. The proxy is the raw data that vctrs will concatenate or subset, and then the restore method restores attributes based on the result. You might need to define your proxy to return a data frame if you have vectorised attributes (i.e. metadata in your attributes for each element of the S3 vector).
I've been meaning to look into sf-vctrs compatibility, sorry I didn't have time to get to it yet.
Thanks - I think I'll wait for your drastic simplification before diving into this!
Thanks for investigating...
I'll wait for an update.
In #1196, @lionel- seems to have fixed this - thanks Lionel!
It looks like it!
I'm now getting:
#> Simple feature collection with 1018 features and 19 fields
#> geometry type: GEOMETRY
#> dimension: XY
#> bbox: xmin: 834044 ymin: 6504749 xmax: 943513 ymax: 6604240
#> epsg (SRID): NA
#> proj4string: NA
With the legacy tidyr, I'm seeing these warnings indicating attributes were being lost:
#> Warning messages:
#> 1: In bind_rows_(x, .id) :
#> Vectorizing 'sfc_POLYGON' elements may not preserve their attributes
#> 2: In bind_rows_(x, .id) :
#> Vectorizing 'sfc_MULTIPOLYGON' elements may not preserve their attributes
#> 3: In bind_rows_(x, .id) :
#> Vectorizing 'sfc_POLYGON' elements may not preserve their attributes
The loss of these attributes is my working hypothesis for the different bbox value for the sf data frame produced by legacy tidyr:
#> bbox: xmin: 834044 ymin: 6445179 xmax: 1027185 ymax: 6604240
With vctrs-powered tidyr we now manipulate data frames (splitting and combining on rows) while correctly preserving attributes.
@dicorynia Note that as a stopgap you can use unnest_legacy() (with the warnings) until we get these fixes on CRAN.
Thank you both of you, on this issue and for your work on sf...
There are still two issues running this case:
POLYGON and MULTIPOLYGON we get GEOMETRY rather than the more friendly MULTIPOLYGON (sf:::rbind.sf also results in GEOMETRY)I think the first we should address, the second may be a somewhat deeper problem, related to usability.
Agreed that we should look into geometries later on, probably in several stages.
Regarding the CRS attributes, they are propagated from the first element without any check right? I see in c.sfc:
attributes(ret) = attributes(lst[[1]]) # crs
With this in vec_ptype2()
# Propagate CRS attributes from the LHS
ret = st_sfc(crs = st_crs(x))
We now have:
#> Simple feature collection with 1018 features and 19 fields
#> geometry type: GEOMETRY
#> dimension: XY
#> bbox: xmin: 834044 ymin: 6504749 xmax: 943513 ymax: 6604240
#> epsg (SRID): 2154
#> proj4string: +proj=lcc +lat_1=49 +lat_2=44 +lat_0=46.5 +lon_0=3 +x_0=700000 +y_0=6600000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
If it doesn't make sense to always propagate the CRS of the first element, other possible strategies are:
Regarding precision, what should be the combination strategy? Maybe take the largest precision?
Just a remark - all code relating to crs is in flux because of changes in external software that sf builds on. The assumptions made when the crs object was defined no longer hold with certainty https://github.com/r-spatial/sf/issues/1187. Experiments are progressing in sp/rgdal, but extension to sf will folllow. So crs handling will need to be marked and possibly revisited before March 2020. Probably clearing crs should clear sf too, because geometries without correct metadata are not much use to anyone, and as of now, we do not know how to test whether crs are identical in WKT2_2018 representation.
Thanks @rsbivand - this will not affect the work here, since we are only assigning / copying entire CRS objects, possibly checking they are identical, which is handled by other code (the == method for crs).
It was the == method I was worried about, as I fear we'll have to importFromWkt() both, and then do the comparison inside GDAL. I vaguely recall code in pyproj or elsewhere, but still experimental I think.
@lionel- I think vec_ptype2() should do this:
if (st_crs(x) != st_crs(y))
stop("coordinate reference systems not equal: use st_transform() first?")
# Propagate CRS attributes from the LHS
ret = st_sfc(crs = st_crs(x))
@rsbivand that == or != does the right thing here is important, indeed, but a different issue.
@edzer oops I now see your message, I'll update the error messages to use your wording.
by merging POLYGON and MULTIPOLYGON we get GEOMETRY rather than the more friendly MULTIPOLYGON (sf:::rbind.sf also results in GEOMETRY)
Is there a description or implementation of the combination strategy you're looking for? AFAICS GEOMETRY is consistently implemented as the common type for all combinations of incongruent geometries:
class(st_sfc(st_point(), st_multipoint()))
#> [1] "sfc_GEOMETRY" "sfc"
class(st_sfc(st_polygon(), st_multipolygon()))
#> [1] "sfc_GEOMETRY" "sfc"
Also, should the sfc geometry be set to MULTIPOLYGON even though it contains POLYGON geometries? Or should the individual geometries be upcoerced to MULTIPOLYGON? But surely changing the individual types is potentially problematic. Can a multipolygon of size 1 considered equal to the equivalent polygon (as I understand yes, through the notion of spatial equality)?
It is a more generic usability thing, although I brought it up I would also leave it for here/now.
Users can anticipate this by specifying type=6 to st_read, which causes POLYGON geometries to be read as (single polygon) MULTIPOLYGONs, this then would not lead to GEOMETRY at the end; or alternatively use st_cast(. ,"MULTIPOLYGON") at the end of the pipe.
Most helpful comment
Thanks - I think I'll wait for your drastic simplification before diving into this!