Dplyr: join_mutate(x = <sf>)

Created on 2 Mar 2020  Â·  14Comments  Â·  Source: tidyverse/dplyr

library(sf)
#> Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(qualmap)
library(dplyr, warn.conflicts = FALSE)

sf <- mutate(stLouis, TRACTCE = as.numeric(TRACTCE))
df <- tibble(TRACTCE = 112100, x = 1)

left_join(sf, df)
#> Joining, by = "TRACTCE"
#> Error: `nm` must be `NULL` or a character vector the same length as `x`

Created on 2020-03-02 by the reprex package (v0.3.0.9000)

This happens here: https://github.com/tidyverse/dplyr/blob/master/R/join.r#L296

x_key <- set_names(x[vars$x$key], names(vars$x$key))

because x is an sf object and the [ method brings back the geometry column, so x[vars$x$key] does not have the same number of columns as the length of names(vars$x$key)

bug join

Most helpful comment

Ok, I'm pretty happy with the current fix — we'll need to keep thinking about what the optimal behaviour is here, but at least sf objects work once more.

All 14 comments

A number of packages exhibit that issue as part of the latest revdep checks:

`nm` must be `NULL` or a character vector the same length as `x`
===============================================================

areal, chilemapas, foieGras, GADMTools, nhdplusTools, qualmap, raceland, 
sabre, spatialrisk, stplanr

I'm not sure all these are related to sf though.

I guess this is part of new rlang set_names() behavior. I wasn't aware that this behavior wasn't already an error in rlang before I moved set_names() to C, so it seemed reasonable to enforce it.

I guess we could convert x and y to tibbles earlier with as_tibble() to ensure that we don't have any of these "sticky" columns when subsetting? I think @lionel- has seen these sticky columns a lot, so maybe he has an idea too.

If we did that, I think we'd need to replace x_out with x in dplyr_reconstruct(out, x_out)?

The sf data frames are hard to program with because x[idx] isn't guaranteed to be length(idx). Transforming inputs to a tibble in the data.frame method might be reasonable to avoid surprises. The problem is that this prevents input classes from checking invariants / updating metadata at each manipulation of the data frame, which seems wrong at some level. Maybe it's better to keep this error, and let sf maintainers override nest_join() to make their class compatible with dplyr.

@lionel- I thought I added new behavior that closed https://github.com/r-lib/rlang/issues/886, but it turns out that this was actually fixed in (at least) rlang 0.4.4, so I didn't actually add anything new

@edzer — I wanted to alert you to this issue. The basic problem is that sf breaks an implicit contract that we expect data.frame subclasses to fulfil. It's not totally clear to me what we need to do fix it — sf is an important class so we could change dplyr to make it work (or at least figure out how to make this bit of code generic). But I wanted to check with you that you were committed to this behaviour before we did a significant amount of work.

One potential compromise would be to allow 1-parameter [ to lose the geometry columns, and keep the sticky behaviour for the 2-parameters [. This seems to make sense if we consider the former to be programmer-oriented and the latter to be user-oriented.

The sf data frame could devolve either into a partially instantiated state (waiting for a geometry column to be added back later on), or to a bare data frame.

Thanks for notifying me! @lionel- on the sf side I don't think that is an option - a typical user pattern is to call plot(nc["BIR74"]) and expect a map with nc county polygons colored according to the BIR74 attribute.

What I now did is strip the sf class label and put it back after dplyr did its job; many of the dplyr methods are implemented that way in sf.

@edzer are you happy to continue with that approach? It sounds like you'll now need to add methods for all the join functions.

Thanks @hadley - let's put it this way: I would be the last person who'd be unhappy if dplyr handled sf objects the "right" way out of the box, but I also understand that you want x[idx] to have length length(idx), which sf breaks. As long as dplyr handles sfc geometry list columns correctly (thanks to @lionel- for the vctrs wrappers!), and if changes in behaviour like stripping attributes are announced early, the additional work to be done in sf is quite managable.

sf has provided dplyr compatible *_join methods for three years now.

@edzer you can consider this informal notification :smile: — we'll be sending out the formal revdep emails later in the week. We now have a somewhat more principled approach described in ?dplyr_extending; although unfortunately it's not going to be super helpful for you because of the aforementioned [ behaviour.

This is going to hit all users now using join_xxx(x,y) with y an sf object. The error message now given by dplyr,

Error: `nm` must be `NULL` or a character vector the same length as `x`

is not helpful and I've already experienced several users then seeking help with the sf developers. Would it be possible to emit a more helpful error message, e.g. pointing to https://github.com/r-spatial/sf/issues/1314 , to be removed a couple of releases later?

Ah, I didn't think about the y case; that's definitely worth fixing. I'll think about it and see what I can come up with.

Ok, I'm pretty happy with the current fix — we'll need to keep thinking about what the optimal behaviour is here, but at least sf objects work once more.

Great, thanks a lot!!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Eli-Berkow picture Eli-Berkow  Â·  4Comments

bachlaw picture bachlaw  Â·  3Comments

tjmahr picture tjmahr  Â·  4Comments

yutannihilation picture yutannihilation  Â·  5Comments

DasHammett picture DasHammett  Â·  3Comments