Filing this issue after brief discussion on #r-spatial/mapedit/issues/46 and a bit more research.
Background: split/map/combine patterns are very useful for sf objects too, and I couldn't find a fast and officially-recommended way of recombining a list of sf objects to a single data frame.
The three approaches I've found in other issues/repos were:
purrr::reduce(list_of_sf, sf:::rbind.sf) is the slowest (creates, for N elements, N-1 intermediate objects that are progressively rbind-ed together)do.call(what = sf:::rbind.sf, args = list_of_sf) which is fast for small lists, but slower past a few thousand elements/moderate data size (details below); andmapedit:::combine_list_of_sf(list_of_sf) which scales quite well by using dplyr::bind_rows under the hood and handling geometry/attrs separately.I profiled the three approaches in this gist, confirming above.
Performance over 1,000 list elements:

Performance over 10,000 list elements (dropping the purrr::reduce strategy to run in finite time).

@edzer 's comments in the other issue:
- iirc, bind_rows cannot be extended for sf objects because its first argument is ... and it is an S3 generic, unlike rbind and cbind which use a different method dispatch mechanism
- given that bind_rows is easily a factor 20 faster than rbind.sf, I think it makes sense to add an bind_rows_sf to sf.
As always, thanks for sf!
Note that bind_rows not working with sf objects causes things like purrr::map_dfr to fail, which can be quite annoying. I'm starting to love purrr::map functions and their parallel application via furrr.
For completeness, I have successfully used sf::st_as_sf(data.table::rbindlist(<list_of_sf>)) though I have not benchmarked it. Would be good to see how this approach compares to the ones benchmarked above.
Ok, so the data.table approach is something else (as expected I guess):
> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+ dplyr::sample_n(size = 1000, replace = TRUE) %>%
+ mutate(sample_id = paste0("sample_", row_number())) %>%
+ split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+ do.call(what = sf:::rbind.sf,
+ args = nc_list_1k),
+ purrr::reduce(.x = nc_list_1k,
+ .f = sf:::rbind.sf),
+ sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+ times = 25L)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542 25 c
do.call(what = sf:::rbind.sf, args = nc_list_1k) 894.177444 912.931392 928.037925 925.873504 938.368144 992.35103 25 b
purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273 25 d
sf::st_as_sf(data.table::rbindlist(nc_list_1k)) 4.802647 5.438064 5.904802 5.678197 5.819243 10.58275 25 a
> set.seed(1234)
> nc_list_10k <- nc %>%
+ dplyr::sample_n(size = 10000, replace = TRUE) %>%
+ mutate(sample_id = paste0("sample_", row_number())) %>%
+ split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+ do.call(what = sf:::rbind.sf,
+ args = nc_list_10k),
+ sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+ times = 25L)
Unit: milliseconds
expr min lq mean median uq max neval cld
mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328 25 b
do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150 25 c
sf::st_as_sf(data.table::rbindlist(nc_list_10k)) 61.6442 63.85123 72.41966 66.14747 70.3542 112.2899 25 a
I think for me, personally it is pretty clear how I will do such an operation in the future :-).
promoting data.table, I like it! I still remember, @mattdowle presenting his package at the EARL conference in London a few years ago, slightly confused about all the fuss that was made back then about the pipe operator, saying almost exasperatedly "but we have been doing chaining in data.table for years". Don't get me wrong I also like dplyr, just a nice anecdote I wanted to share.
That's great! the data.table::rbindlist solution also gives an option for adding the .id column, which I also use all the time in dplyr::bind_rows. (mapedit:::combine_list_of_sf doesn't expose the .id but it's not an exported function.)
On the original mapedit issue there were some good questions on crs handling where lists may have the same/ mixed / no crs. I'm new-ish to r-spatial/sf and don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)
One other thought from the new-ish to sf perspective -- there's a bunch of demo text on working with sf and dplyr verbs for non-spatial columns in vignettes and how-to blogs, but relatively less on split/map/combine and common purrr workflows with sf. (Like @adrfantini I've started using this all the time.) If it's worth another smallish vignette after this direction is established, I'd be happy to take a crack at it.
@tim-salabim as an aside, I've run across several helpful tweets from you on blending sf and data.table to optimize pipelines, and always meant to dig in further. Your benchmark ^ is a nice kick to do that. Consider this an appreciative audience request for an encore, too...
I will. I've just revisited a long intended implementation for building quadtrees using Rcpp and data.table (after a gentle reminder). https://twitter.com/TimSalabim3/status/943194334333227008 Hopefully I can come up with a general function for this in the not too distant future... It's such a fun problem!
On the original mapedit issue there were some good questions on
crshandling where lists may have the same/ mixed / nocrs. I'm new-ish to r-spatial/sfand don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)
I was recently dealing with State Plane Coordinate Systems (SPCSes) and ran into this issue. I had WGS84-projected data and wanted to get the area, in meters, of each of a few million polygons spread throughout the state of Washington. My options were:
st_area. This is accurate for the northern half of the state, and off for the southern half of the state.st_area for both CRSes separately, convert them both back to WGS84, and combine the results. This would have been the most accurate, but would have involved a lot of reading documentation and writing split/apply/combine code.I went with 1 due to time constraints. I would have gone with 2 if the split/apply/combine elements were handled automatically in sf.
As for the right way to handle this, my intuition is that we should operate on each data frame separately in a vectorized manner that takes the crs into account. It would work similarly to a dplyr groupby object, but with a different crs for each "group". For combining, I imagine an st_bind_rows function with a crs_transform argument (I'm agnostic about the actual name) that specifies "which crs should we transform all of the data.frames to?"
We might also want to allow a named vector argument of c(orig_crs1 = new_crs1, ..., orig_crs_i = orig_crs_j, ...) to reduce the original data.frame collection to a smaller subset, but not necessarily a single data.frame.
It would help if you could provide a minimal reproducible data example, along with what you've tried.
Ok, so the data.table approach is something else (as expected I guess):
> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf")) > set.seed(1234) > nc_list_1k <- nc %>% + dplyr::sample_n(size = 1000, replace = TRUE) %>% + mutate(sample_id = paste0("sample_", row_number())) %>% + split(.$sample_id) > microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k), + do.call(what = sf:::rbind.sf, + args = nc_list_1k), + purrr::reduce(.x = nc_list_1k, + .f = sf:::rbind.sf), + sf::st_as_sf(data.table::rbindlist(nc_list_1k)), + times = 25L) Unit: milliseconds expr min lq mean median uq max neval cld mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542 25 c do.call(what = sf:::rbind.sf, args = nc_list_1k) 894.177444 912.931392 928.037925 925.873504 938.368144 992.35103 25 b purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273 25 d sf::st_as_sf(data.table::rbindlist(nc_list_1k)) 4.802647 5.438064 5.904802 5.678197 5.819243 10.58275 25 a > set.seed(1234) > nc_list_10k <- nc %>% + dplyr::sample_n(size = 10000, replace = TRUE) %>% + mutate(sample_id = paste0("sample_", row_number())) %>% + split(.$sample_id) > microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k), + do.call(what = sf:::rbind.sf, + args = nc_list_10k), + sf::st_as_sf(data.table::rbindlist(nc_list_10k)), + times = 25L) Unit: milliseconds expr min lq mean median uq max neval cld mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328 25 b do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150 25 c sf::st_as_sf(data.table::rbindlist(nc_list_10k)) 61.6442 63.85123 72.41966 66.14747 70.3542 112.2899 25 aI think for me, personally it is pretty clear how I will do such an operation in the future :-).
I know this is a necro comment, however I just implemented the data.table 'melting' of a list of sf objects into a single sf object, and while it works and is fast, the newly created object's extent is set to the extent of the first element of the original list without (to my knowledge) a simple way to reassign.
This behavior is not reproduced when using do.call(rbind...).
If anyone rocks up here wondering if bind_rows() should work on sf objects, it looks like it will be supported in the future 0.9.0 version: https://github.com/tidyverse/dplyr/issues/2457#issuecomment-558962592
For the sake of completeness, here's and example and the error we currently get (with dplyr 0.8.3).
# dataframe of US states with coordinates
states <- data.frame(name = state.name,
x = state.center$x,
y = state.center$y,
this = "this",
that = "that")
# create sf subsets, with different columns
library(sf)
states1 <- st_as_sf(states[1:10,1:4], coords = c("x", "y"))
states2 <- st_as_sf(states[21:30,c(1:3,5)], coords = c("x", "y"))
# with dplyr
library(dplyr)
bind_rows(states1, states2)
Which results in the following error and warnings:
Error in .subset2(x, i, exact = exact) :
attempt to select less than one element in get1index
In addition: Warning messages:
1: In bind_rows_(x, .id) :
Vectorizing 'sfc_POINT' elements may not preserve their attributes
2: In bind_rows_(x, .id) :
Vectorizing 'sfc_POINT' elements may not preserve their attributes
Good news, dplyr v1 is coming, and if you install the most recent version from github now...
remotes::install_github("tidyverse/dplyr")
you'll get a version where a simple bind_rows() of sf objects just works!
NZ_fixed <- bind_rows(NZ_2020, NZ_1995)
If you do a basic mistake, like joining objects with a different crs (who would do that :)), you get a neat informative error!
Error in common_crs(x, y) :
coordinate reference systems not equal: use st_transform() first?
Edit Also solves this issue
Most helpful comment
Ok, so the data.table approach is something else (as expected I guess):
I think for me, personally it is pretty clear how I will do such an operation in the future :-).