What is the optimal approach for applying over the rows of an sf object?
Using the base apply does not seem to work out, since it does not like the geometry column.
Another option is to lapply or purrr::map over the output of st_geometry only, but of course this has limitations. One more convenient option is maybe to purrr::map over the output of split, however I think this might be slow for large objects, probably a lot of memory duplication.
See these SO questions for examples and use-cases:
https://stackoverflow.com/questions/43694187/row-wise-operations-on-sf-objects
https://stackoverflow.com/questions/48505551/use-apply-with-a-simple-features-sf-function
Maybe a specific method would be nice?
Have you seen this? https://www.rstudio.com/resources/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame/ (I haven't watched it yet - I don't see this issue isolated sf, and I don't think R has a best-solution yet either, though maybe that vid has more)
For both examples, it looks to me as if the authors want to work on the geometry column only, which is quite trivial, but you want to loop over the rows of an sf object, i.e. also involve non-geometry attributes associated with each geometry in each function call? What is the use case?
@edzer Yes. Example use case which happened to me recently:
read in a dataset with stars, and perform statistic on N different regions (defined in an sf object: 1 row = 1 region) by subsetting the stars object, while keeping information on the region. Some code will explain it better: (something like this, but of course using a much more efficient apply loop, this is just for clarity)
s = read_stars(fn_s)
sf = st_read(fn_sf)
d = NULL
for (i in 1:nrow(sf)) {
sfi = sf[i,]
result = data.frame(
statistic=somefunction(s[sfi], sfi$flag),
region=sfi$region,
area=sfi$area,
smth=sfi$smth
)
d = rbind(d, result)
}
Yes, this can be done using lapply, or map, or something similar, but if just feels like a workaround.
I think apply is the wrong thing. It's for arrays, and only works for data frames by coercing them to a matrix. The row-wise ops are not workarounds, it's essentially inefficient to be row-wise on something col-wise.
I see what you're saying now about stars, though - I see this as a group_by and I've been trying a few things in a NetCDF context. The semantics I've been trying (in quite different code) is
stars %>% group_by(sf, <space>) %>% mutate(or)summarize
It's not the usual way of group_by, but the sf object does uncontroversially define non-rectangular groups of cells in the stars grid. I don't know how this is seen in CDO contexts, we were discussing literally yesterday whether those ops can take in a grouped-mask, or is the task split with a mask for each polygon?
The <space> part is a way of nominating which pair of dimensions the sf object applies to (common default is x/y). So, nothing to help you but I appreciate you expanding on this - I'm thinking about it!
@edzer and I did discuss this at some point, I'll see if we did that online or by email.
Yes, it might be that apply is not what I actually need. Sorry, I should have provided an example use-case from the start.
It might be more appropriate to move this to a stars issue instead?
Well, depending on the use case of course, but stars is in any case full with apply potential (and have a look at stars::st_apply first)
I cannot see how to use st_apply in this case, as far as I understand that works only on stars dimensions. It looks to me that the apply (or similar) here needs to be done on the sf object, and the stars passed as argument.
In short, I need one way to combine one stars object with a multi-feature sf object, while keeping the information of each feature.
I've been using split and map but I feel it's not very efficient.
Do you want to simply group them, or summarize them as well by feature?
Also summarize, see the example above.
In general, I would like to be able to apply a function to each row of the sf object, as if I was doing split and map, but without the overhead this implies (or at least I think it does).
In this particular case the function implies subsetting a stars object. The function I'd like to apply could be something like this, to take the mean of each region and store it in a tidy data.frame with columns for additional info:
mean_of_region <- function(sfrow, stars_object) {
d = stars_object[sfrow] %>% st_apply(1:2, mean) %>% as_tibble
d$region = sfrow$region,
d$something = sfrow$something
d
}
I appreciate you humouring me, so I've created a dummy raster to check that we're on the same page.
This apply_fun is equivalent to what I meant above with raster %>% group_by(sf) %>% summarize(funs), (groups defined per polygon as the pixels covered by a row in the sf).
## example data, a polygon layer and a grid of pixel values
## using raster as a prototype
library(sf)
example(st_read)
library(fasterize)
rr <- raster::raster(raster::extent(st_bbox(nc)[c(1, 3, 2, 4)]), nrows = 8, ncols = 16)
rr[] <- seq(raster::ncell(rr))
rr <- raster::disaggregate(rr, fact = 4, method = "")
apply_fun <- function(ras, sfd, funs) {
sfd$sfd_row <- seq_len(nrow(sfd))
groupraster <- fasterize::fasterize(sfd, ras, field = "sfd_row")
summ <- tibble::tibble(value = raster::values(ras),
sfd_row = raster::values(groupraster)) %>%
dplyr::filter(!is.na(sfd_row)) %>% dplyr::group_by(sfd_row) %>%
dplyr::summarize_all(funs)
## use temp copy rather than entire pipeline to maintain sf class
sfd %>% dplyr::inner_join(summ, "sfd_row") %>%
dplyr::arrange(sfd_row) %>% dplyr::select(-sfd_row)
}
## column "value" is the mean of the raster in each polygon
apply_fun(rr, nc, mean)
## we can name the funs, so it's more obvious "mean" and "sd"
ncgridsumm <- apply_fun(rr, nc, list(mean = mean, sd = sd))
plot(ncgridsumm[c("mean", "sd")])
I don't think that kind of workflow is well-defined anywhere, but it's something I use routinely. It's now massively faster with fasterize (faster than raster::extract(ras, poly, fun = mean)) .
Is this equivalent to what you mean? I'm not au fait enought with stars to see how but I think it's worth having a concrete example so we have real numbers to compare. Can you come up with a real example? It's hard to contrive something interesting when you need two layers like this.
I'll take a look at it tomorrow. Thanks!
@edzer
For both examples, it looks to me as if the authors want to work on the geometry column only, which is quite trivial, but you want to loop over the rows of an sf object, i.e. also involve non-geometry attributes associated with each geometry in each function call? What is the use case?
This use case comes up in my work from time to time, mostly when I'm interested in running spatial operation on a subset of rows and/or I want to pass a variable to a parameter of a spatial function.
Here's a reprex showing how I go about it - please excuse the nonsensical nature of the example:
library(units)
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.3, proj.4 4.9.3
library(tidyverse)
demo(nc, ask = FALSE)
set.seed(314159)
make_buffer_dist <- function(x) {
# this is a bogus function, for illustration purposes only
st_area(x) %>% as.double() %>% sqrt() %>% round() %>% set_units("ft")
}
make_buffer <- function(geom_pt, dist) {
geom_pt %>%
st_geometry() %>%
st_centroid(of_largest_polygon = TRUE) %>%
st_buffer(dist = dist)
}
nc_buffer <- nc %>%
st_transform(2264) %>% # NC state plane, US foot
transmute(
NAME,
dist = make_buffer_dist(.),
geom_pt = st_centroid(st_geometry(.), of_largest_polygon = TRUE),
geom_buffer = make_buffer(., dist)
) %>%
st_set_geometry("geom_buffer")
## Warning: package 'bindrcpp' was built under R version 3.4.2
plot(st_geometry(nc_buffer))
conditional_buffer <- function(...) {
# browser()
l <- list(...)
if (l$buff_lgl) {
make_buffer(l$geom_pt, l$dist)
} else {
st_polygon() %>% st_sfc()
}
}
nc_buff_grp <- nc_buffer %>%
mutate(buff_lgl = sample(c(TRUE, FALSE), n(), replace = TRUE)) %>%
mutate(geom_buff_grp = pmap(., conditional_buffer) %>% reduce(c) %>% st_set_crs(st_crs(.))) %>%
st_set_geometry("geom_buff_grp")
plot(st_geometry(nc_buff_grp), border = "red", add = TRUE)

Session info
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Los_Angeles
## date 2018-04-17
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.2)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.2)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.2)
## class 7.3-14 2015-08-30 CRAN (R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN (R 3.4.2)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
## compiler 3.4.0 2017-04-21 local
## crayon 1.3.4 2018-02-12 Github (r-lib/crayon@95b3eae)
## curl 3.1 2017-12-12 CRAN (R 3.4.3)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.8 2018-03-02 CRAN (R 3.4.4)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.15 2018-01-28 CRAN (R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN (R 3.4.2)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3)
## foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
## ggplot2 * 2.2.1.9000 2018-03-27 Github (thomasp85/ggplot2@f1ba983)
## glue 1.2.0.9000 2018-03-02 Github (tidyverse/glue@9d96cbf)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.2)
## hms 0.4.0 2017-11-23 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr 1.3.1 2017-08-20 CRAN (R 3.4.2)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.19 2018-01-29 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.2 2018-02-06 CRAN (R 3.4.3)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## mime 0.5 2016-07-07 CRAN (R 3.4.0)
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.2)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.0 2017-04-21 local
## pillar 1.1.0.9000 2018-02-12 Github (r-lib/pillar@595d1ac)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4.9000 2018-03-02 Github (tidyverse/purrr@84ce1ad)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## Rcpp 0.12.16 2018-03-13 CRAN (R 3.4.4)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.2)
## reshape2 1.4.3 2017-12-11 CRAN (R 3.4.4)
## rlang 0.2.0.9001 2018-03-27 Github (tidyverse/rlang@49d7a34)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## RPostgreSQL 0.6-2 2017-06-24 CRAN (R 3.4.4)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.2)
## scales 0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)
## sf * 0.6-2 2018-04-15 Github (r-spatial/sf@89eabc4)
## stats * 3.4.0 2017-04-21 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.3.0 2018-02-19 CRAN (R 3.4.3)
## tibble * 1.4.2 2018-02-12 Github (tidyverse/tibble@80ac1a3)
## tidyr * 0.8.0.9000 2018-03-02 Github (tidyverse/tidyr@96ded3f)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3)
## tools 3.4.0 2017-04-21 local
## udunits2 0.13 2016-11-17 CRAN (R 3.4.1)
## units * 0.5-1 2018-01-08 CRAN (R 3.4.3)
## utils * 3.4.0 2017-04-21 local
## withr 2.1.2 2018-03-27 Github (jimhester/withr@79d7b0d)
## xml2 1.2.0 2018-01-24 CRAN (R 3.4.4)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
Following @mdsumner 's example, one could do
library(stars) # maybe after devtools::install_github("r-spatial/stars")
s = st_as_sf(st_as_stars(rr), as_points = TRUE)
a = aggregate(st_set_crs(s, st_crs(nc)), nc, mean)
plot(a)
where aggregate basically does the group_by %>% summarise pattern.
That works great for a single slice data set. My approach is motivated by applying this to tens of thousands of slices across separated file bricks - for climate model output. I can't see how current stars aggregate can be applied to that situation? Until there's abstraction over out of memory data sources, the index has to be kept separated. Here's the apply and index-creation functions separated, just fyi.
## workhorse function to apply to a brick, with index (polygon id per cell), the funs
apply_funs <- function(ras, index, funs) {
l <- vector("list", raster::nlayers(ras))
for (i in seq_len(raster::nlayers(ras))) {
## assuming ras is a 3D brick
index[["values"]] <- raster::values(ras[[i]])[index[["cell"]]]
## a bit slower
## index[["values"]] <- raster::extract(ras[[i]], index[["cell"]])
l[[i]] <- index %>% dplyr::group_by(sfd_row) %>%
dplyr::select(-cell) %>%
dplyr::summarize_all(funs)
}
dplyr::bind_rows(l, .id = "slice_in_brick")
}
## classify a raster cells by polygons (sf-data)
build_index <- function(ras, sfd) {
sfd$sfd_row <- seq_len(nrow(sfd))
groupraster <- fasterize::fasterize(sfd, ras[[1]], field = "sfd_row")
tibble::tibble(cell = seq_len(raster::ncell(ras)),
sfd_row = raster::values(groupraster)) %>%
dplyr::filter(!is.na(sfd_row))
}
library(sf)
example(st_read)
library(fasterize)
rr <- raster::raster(raster::extent(st_bbox(nc)[c(1, 3, 2, 4)]), nrows = 8, ncols = 16)
rr[] <- seq(raster::ncell(rr))
rr <- raster::disaggregate(rr, fact = 4, method = "")
index <- build_index(rr, nc)
l <- vector("list", 2)
for (i in 1:2) {
## pretend that rr is updated for each i by reading a brick from file
l[[i]] <- apply_funs(rr, index, list(mean = mean, sd = sd))
}
Most helpful comment
That works great for a single slice data set. My approach is motivated by applying this to tens of thousands of slices across separated file bricks - for climate model output. I can't see how current stars aggregate can be applied to that situation? Until there's abstraction over out of memory data sources, the index has to be kept separated. Here's the apply and index-creation functions separated, just fyi.