Sf: Should select() keep geom ?

Created on 20 Dec 2016  路  8Comments  路  Source: r-spatial/sf

I just wanted to mention I was surprised to change class when doing e.g. :

st_read(system.file("shape/nc.shp", package="sf"), quiet = TRUE) %>%
  select(AREA) %>%
  plot()

To silently lose the geometry. Although that might be the most logical effect, it still changes the class of the object, which is, I believe, unexpected.

I see three possibilities :

  • stay silent (current situation)
  • keep geom cols (might also be unexpected)
  • warn used he's casting to data.frame (might be annoying)

I think they might all be not so extraordinary, but I think a warning would be my favourite.

Most helpful comment

We were talking a lot about type-stable functions. In my opinion, the best idea is to keep the geometry. In this case you will have a sf object as an input, next (for example) select(AREA), and still have a sf object as an output.

All 8 comments

I am also in favor of select having sticky geometry - if you don't want this, you could simply coerce to a data.frame, or work with data.frame instead of sf in the first place. [.sf also has sticky geometry.

What do the dplyr experts say on this, present in https://github.com/edzer/sfr/issues/42 ?

We were talking a lot about type-stable functions. In my opinion, the best idea is to keep the geometry. In this case you will have a sf object as an input, next (for example) select(AREA), and still have a sf object as an output.

How about e.g. mutate() if you create a buffer for instance ? What should mutate(clearance=st_buffer(geometry)) return ? Is it allowed to have more than one geometry column ?

Well, this is S3, where anything is allowed. You can add a geometry list-column to a data.frame, or two of them, or five. And also to an sf object, unless we overwrite $<-, [<- etc to break on that condition, and then you might find a way around that.

But how do you expect the output of that to behave when plotted, or when it is used to compute intersections with another object? In any case, the sf_column attribute of sf objects point to `the'' geometry column, that is whatst_geometry()` will look at.

Ah, S3:heart:! That's real freedom and it comes with responsibilities ! @edzer, it seems to imply a lot, I don't know ? For what I know of postgis as an example of multiple column support, you have to specify the column you want. However, QGIS will take a guess at the first column if not specified, I believe (maybe it's more elaborate but it guesses). It works most of the time.

Could mutate always operate on the geometry column if a geometry is returned or politely force the user to operate on the attr(x, "sf_column") ? Multiple column support could make sense (it's clearly useful), but maybe it should be added later to the roadmap ?

I see that spdplyr dodged the question because geometry isn't exposed (only the @data slot is used). Lucky @mdsumner !

I think select should return the same class it was given, thus retaining the geom column. However, I have seen the following around the internet to pull out a vector from a tbl_df:

iris %>% 
  select(Sepal.Width) %>% 
  .[[1]]

So, to dummy proof against things like this, I would suggest always putting the geom column at the end if, and only if, the geom column is not specifically selected.

@etiennebr It was by design really, not exactly lucky :) but really there's no way to modify geometry in that spdplyr context - it's just a wrapper around a two-table model - spdplyr is a sp-user level device written using table-techniques available in spbabel. The modify-geometry stuff is done with sptable<- in replacement-mode, but still it's dev-stuff, "dragons lurk here" territory because there are so many ways to do it wrong.

A purist table-view would say you cannot index columns as if it were a matrix, and then "sticky" would be less of a complication. but other than saying that I don't have any good ideas. I think the dplyr group_by/summarize case is an interesting one to support, since if you don't do an explicit summarize operation on the geometry, you need a policy on what happens in the "sticky" case.

x %>% group_by(col1) %>% summarize(col2 = mean(col2), st_union(geometry))  ## totally fine

x %>% group_by(col1) %>% summarize(col2 = mean(col2))  ## what should happen? 

GEOMETRYCOLLECTION is an obvious default, but union or intersect may be a more user-friendly idea.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

faridcher picture faridcher  路  4Comments

jsta picture jsta  路  4Comments

kendonB picture kendonB  路  3Comments

Nosferican picture Nosferican  路  3Comments

gregmacfarlane picture gregmacfarlane  路  4Comments