Sf: bind_rows_sf for combining lists of sf objects to single dataframe

Created on 16 Jul 2018  路  11Comments  路  Source: r-spatial/sf

Filing this issue after brief discussion on #r-spatial/mapedit/issues/46 and a bit more research.

Background: split/map/combine patterns are very useful for sf objects too, and I couldn't find a fast and officially-recommended way of recombining a list of sf objects to a single data frame.

The three approaches I've found in other issues/repos were:

  • purrr::reduce(list_of_sf, sf:::rbind.sf) is the slowest (creates, for N elements, N-1 intermediate objects that are progressively rbind-ed together)
  • do.call(what = sf:::rbind.sf, args = list_of_sf) which is fast for small lists, but slower past a few thousand elements/moderate data size (details below); and
  • mapedit:::combine_list_of_sf(list_of_sf) which scales quite well by using dplyr::bind_rows under the hood and handling geometry/attrs separately.

I profiled the three approaches in this gist, confirming above.

Performance over 1,000 list elements:
screenshot 2018-07-15 18 34 55

Performance over 10,000 list elements (dropping the purrr::reduce strategy to run in finite time).
screenshot 2018-07-15 18 35 02

@edzer 's comments in the other issue:

  • iirc, bind_rows cannot be extended for sf objects because its first argument is ... and it is an S3 generic, unlike rbind and cbind which use a different method dispatch mechanism
  • given that bind_rows is easily a factor 20 faster than rbind.sf, I think it makes sense to add an bind_rows_sf to sf.

As always, thanks for sf!

Most helpful comment

Ok, so the data.table approach is something else (as expected I guess):

> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+   dplyr::sample_n(size = 1000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_1k),
+                purrr::reduce(.x = nc_list_1k,
+                              .f = sf:::rbind.sf), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+                times = 25L)
Unit: milliseconds
                                               expr         min          lq        mean      median          uq        max neval  cld
           mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542    25   c 
   do.call(what = sf:::rbind.sf, args = nc_list_1k)  894.177444  912.931392  928.037925  925.873504  938.368144  992.35103    25  b  
 purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273    25    d
    sf::st_as_sf(data.table::rbindlist(nc_list_1k))    4.802647    5.438064    5.904802    5.678197    5.819243   10.58275    25 a   

> set.seed(1234)
> nc_list_10k <- nc %>%
+   dplyr::sample_n(size = 10000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_10k), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+                times = 25L)
Unit: milliseconds
                                              expr        min          lq        mean      median         uq        max neval cld
         mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328    25  b 
 do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150    25   c
  sf::st_as_sf(data.table::rbindlist(nc_list_10k))    61.6442    63.85123    72.41966    66.14747    70.3542   112.2899    25 a  

I think for me, personally it is pretty clear how I will do such an operation in the future :-).

All 11 comments

Note that bind_rows not working with sf objects causes things like purrr::map_dfr to fail, which can be quite annoying. I'm starting to love purrr::map functions and their parallel application via furrr.

For completeness, I have successfully used sf::st_as_sf(data.table::rbindlist(<list_of_sf>)) though I have not benchmarked it. Would be good to see how this approach compares to the ones benchmarked above.

Ok, so the data.table approach is something else (as expected I guess):

> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+   dplyr::sample_n(size = 1000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_1k),
+                purrr::reduce(.x = nc_list_1k,
+                              .f = sf:::rbind.sf), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+                times = 25L)
Unit: milliseconds
                                               expr         min          lq        mean      median          uq        max neval  cld
           mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542    25   c 
   do.call(what = sf:::rbind.sf, args = nc_list_1k)  894.177444  912.931392  928.037925  925.873504  938.368144  992.35103    25  b  
 purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273    25    d
    sf::st_as_sf(data.table::rbindlist(nc_list_1k))    4.802647    5.438064    5.904802    5.678197    5.819243   10.58275    25 a   

> set.seed(1234)
> nc_list_10k <- nc %>%
+   dplyr::sample_n(size = 10000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_10k), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+                times = 25L)
Unit: milliseconds
                                              expr        min          lq        mean      median         uq        max neval cld
         mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328    25  b 
 do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150    25   c
  sf::st_as_sf(data.table::rbindlist(nc_list_10k))    61.6442    63.85123    72.41966    66.14747    70.3542   112.2899    25 a  

I think for me, personally it is pretty clear how I will do such an operation in the future :-).

promoting data.table, I like it! I still remember, @mattdowle presenting his package at the EARL conference in London a few years ago, slightly confused about all the fuss that was made back then about the pipe operator, saying almost exasperatedly "but we have been doing chaining in data.table for years". Don't get me wrong I also like dplyr, just a nice anecdote I wanted to share.

That's great! the data.table::rbindlist solution also gives an option for adding the .id column, which I also use all the time in dplyr::bind_rows. (mapedit:::combine_list_of_sf doesn't expose the .id but it's not an exported function.)

On the original mapedit issue there were some good questions on crs handling where lists may have the same/ mixed / no crs. I'm new-ish to r-spatial/sf and don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)

One other thought from the new-ish to sf perspective -- there's a bunch of demo text on working with sf and dplyr verbs for non-spatial columns in vignettes and how-to blogs, but relatively less on split/map/combine and common purrr workflows with sf. (Like @adrfantini I've started using this all the time.) If it's worth another smallish vignette after this direction is established, I'd be happy to take a crack at it.

@tim-salabim as an aside, I've run across several helpful tweets from you on blending sf and data.table to optimize pipelines, and always meant to dig in further. Your benchmark ^ is a nice kick to do that. Consider this an appreciative audience request for an encore, too...

I will. I've just revisited a long intended implementation for building quadtrees using Rcpp and data.table (after a gentle reminder). https://twitter.com/TimSalabim3/status/943194334333227008 Hopefully I can come up with a general function for this in the not too distant future... It's such a fun problem!

On the original mapedit issue there were some good questions on crs handling where lists may have the same/ mixed / no crs. I'm new-ish to r-spatial/sf and don't have clear thoughts on a "right way" to implement that, but wanted to highlight the questions -- I've definitely stubbed my toe on trying to bind lists of different-crs dataframes in the past. ("Past" = "3 times last week"...)

I was recently dealing with State Plane Coordinate Systems (SPCSes) and ran into this issue. I had WGS84-projected data and wanted to get the area, in meters, of each of a few million polygons spread throughout the state of Washington. My options were:

  1. Convert all of the polygons to Washington State Plane North and calculate st_area. This is accurate for the northern half of the state, and off for the southern half of the state.
  2. Find polygons for Washington State Plane North and Washington State Plane South, spatially join to CRS, calculate st_area for both CRSes separately, convert them both back to WGS84, and combine the results. This would have been the most accurate, but would have involved a lot of reading documentation and writing split/apply/combine code.

I went with 1 due to time constraints. I would have gone with 2 if the split/apply/combine elements were handled automatically in sf.

As for the right way to handle this, my intuition is that we should operate on each data frame separately in a vectorized manner that takes the crs into account. It would work similarly to a dplyr groupby object, but with a different crs for each "group". For combining, I imagine an st_bind_rows function with a crs_transform argument (I'm agnostic about the actual name) that specifies "which crs should we transform all of the data.frames to?"

We might also want to allow a named vector argument of c(orig_crs1 = new_crs1, ..., orig_crs_i = orig_crs_j, ...) to reduce the original data.frame collection to a smaller subset, but not necessarily a single data.frame.

It would help if you could provide a minimal reproducible data example, along with what you've tried.

Ok, so the data.table approach is something else (as expected I guess):

> nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
> set.seed(1234)
> nc_list_1k <- nc %>%
+   dplyr::sample_n(size = 1000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_1k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_1k),
+                purrr::reduce(.x = nc_list_1k,
+                              .f = sf:::rbind.sf), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_1k)),
+                times = 25L)
Unit: milliseconds
                                               expr         min          lq        mean      median          uq        max neval  cld
           mapedit:::combine_list_of_sf(nc_list_1k) 1175.212699 1190.186592 1219.459763 1201.873386 1234.360720 1440.05542    25   c 
   do.call(what = sf:::rbind.sf, args = nc_list_1k)  894.177444  912.931392  928.037925  925.873504  938.368144  992.35103    25  b  
 purrr::reduce(.x = nc_list_1k, .f = sf:::rbind.sf) 8783.179875 8981.547539 9215.896749 9182.200163 9371.552815 9919.67273    25    d
    sf::st_as_sf(data.table::rbindlist(nc_list_1k))    4.802647    5.438064    5.904802    5.678197    5.819243   10.58275    25 a   

> set.seed(1234)
> nc_list_10k <- nc %>%
+   dplyr::sample_n(size = 10000, replace = TRUE) %>%
+   mutate(sample_id = paste0("sample_", row_number())) %>%
+   split(.$sample_id)
> microbenchmark(mapedit:::combine_list_of_sf(nc_list_10k),
+                do.call(what = sf:::rbind.sf,
+                        args = nc_list_10k), 
+                sf::st_as_sf(data.table::rbindlist(nc_list_10k)),
+                times = 25L)
Unit: milliseconds
                                              expr        min          lq        mean      median         uq        max neval cld
         mapedit:::combine_list_of_sf(nc_list_10k) 12212.8645 12531.08956 13102.34104 12778.99254 13164.2622 15267.2328    25  b 
 do.call(what = sf:::rbind.sf, args = nc_list_10k) 41520.9927 42003.92886 43894.39777 42703.18971 45595.1675 52469.4150    25   c
  sf::st_as_sf(data.table::rbindlist(nc_list_10k))    61.6442    63.85123    72.41966    66.14747    70.3542   112.2899    25 a  

I think for me, personally it is pretty clear how I will do such an operation in the future :-).

I know this is a necro comment, however I just implemented the data.table 'melting' of a list of sf objects into a single sf object, and while it works and is fast, the newly created object's extent is set to the extent of the first element of the original list without (to my knowledge) a simple way to reassign.

This behavior is not reproduced when using do.call(rbind...).

If anyone rocks up here wondering if bind_rows() should work on sf objects, it looks like it will be supported in the future 0.9.0 version: https://github.com/tidyverse/dplyr/issues/2457#issuecomment-558962592

For the sake of completeness, here's and example and the error we currently get (with dplyr 0.8.3).

# dataframe of US states with coordinates
states <- data.frame(name = state.name,
                     x = state.center$x,
                     y = state.center$y,
                     this = "this",
                     that = "that")
# create sf subsets, with different columns
library(sf)
states1 <- st_as_sf(states[1:10,1:4], coords = c("x", "y"))
states2 <- st_as_sf(states[21:30,c(1:3,5)], coords = c("x", "y"))

# with dplyr
library(dplyr)
bind_rows(states1, states2)

Which results in the following error and warnings:

Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index
In addition: Warning messages:
1: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes
2: In bind_rows_(x, .id) :
  Vectorizing 'sfc_POINT' elements may not preserve their attributes

Good news, dplyr v1 is coming, and if you install the most recent version from github now...
remotes::install_github("tidyverse/dplyr")
you'll get a version where a simple bind_rows() of sf objects just works!
NZ_fixed <- bind_rows(NZ_2020, NZ_1995)
If you do a basic mistake, like joining objects with a different crs (who would do that :)), you get a neat informative error!
Error in common_crs(x, y) : coordinate reference systems not equal: use st_transform() first?
Edit Also solves this issue

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ekarsten picture ekarsten  路  4Comments

jsta picture jsta  路  4Comments

dpprdan picture dpprdan  路  4Comments

Nosferican picture Nosferican  路  3Comments

jmsigner picture jmsigner  路  4Comments