Sf: left_join of sf objects convert by as.data.frame()

Created on 12 Oct 2019  路  23Comments  路  Source: r-spatial/sf

I have two left_join
df <- dplyr::left_join(routes.sf %>% st_drop_geometry(), shapes.sf %>% st_drop_geometry(), by=c('shape_id'='shape_id')) %>%
glimpse()

df <- dplyr::left_join(routes.sf %>% as.data.frame(), shapes.sf %>% as.data.frame(), by=c('shape_id'='shape_id')) %>%
glimpse()

the first one works nicely but with the second I have a message

Error: Evaluation error: 'names' attribute [180] must be the same length as the vector [167].

I try to do a reprex but I have this message : Error: callr subprocess failed: input string 1 is invalid UTF-8

sf_as.data.frame.txt

All 23 comments

st_drop_geometry drops (removes) the geometry list-column, as.data.frame keeps it, but coerces the object to a data.frame. You are aware that sf has a left_join method for sf objects?

When I try

nc <- left_join(routes.sf, shapes.sf, by=c('shape_id'='shape_id'))

I have an error
Erreur : y should be a data.frame; for spatial joins, use st_join

Yes, left_join.sf requires that the second argument is not an sf object; try

nc <- left_join(routes.sf, as.data.frame(shapes.sf), by=c('shape_id'='shape_id'))

I have the same error

nc <- left_join(routes.sf, as.data.frame(shapes.sf), by=c('shape_id'='shape_id'))
Error: Evaluation error: 'names' attribute [180] must be the same length as the vector [167].

I do the same job with the bus stops : the geometry is POINT and the left_join gives the result.

The geometry of the routes is MULTILINESTRING, it's a difference.

I try to make a simple example

ls <- st_linestring(rbind(c(0,0),c(1,1),c(2,1)))
mls <- st_multilinestring(list(rbind(c(2,2),c(1,3)), rbind(c(0,0),c(1,1),c(2,1))))
ls.sf <- st_sf(a = 1:2, geom = st_sfc(ls, ls))
mls.sf <- st_sf(a = 2:3, geom = st_sfc(mls, mls))
df <- dplyr::left_join(ls.sf %>% as.data.frame(), mls.sf %>% as.data.frame(), by=c('a'='a')) %>%
glimpse()

but it's work

A difference is the output of as.data.frame

mls.sf %>% as.data.frame()
a geom
1 2 MULTILINESTRING ((2 2, 1 3)...
2 3 MULTILINESTRING ((2 2, 1 3)...

routes.sf %>% as.data.frame()
shape_id geometry
299439-(no role) 0064-A-4001-1167 MULTILINESTRING ((-1.52638 ...
319321-(no role) 0009-A-1372-1390 MULTILINESTRING ((-1.65907 ...
324474-(no role) 0050-A-3301-1563 MULTILINESTRING ((-1.573188...
324904-(no role) 0041-B-1663-2127 MULTILINESTRING ((-1.670629...

Yes, and

df <- left_join(ls.sf, mls.sf %>% as.data.frame(), by=c('a'='a'))

also works. Strange?

The problem is that in your original example, the geometry list columns have a names attribute. If you remove this, by

names(routes.sf$geometry)=NULL
names(shapes.sf$geometry)=NULL

then things work.

Is it possible that these objects were created by an old version of sf? In that case, please recreate them. If they are created by new versions of packages, please let me know how they were created so we can find out who adds names to geometries (this has caused more trouble in other places, so was removed).

gtfs.sf

library(tidyverse)
library(tidytransit)
tt <- tidytransit::read_gtfs(dsn)
shapes.sf <- shapes_as_sf(tt$shapes) %>%
glimpse()

'# en direct de https://www.natedayta.com/2018/06/02/extending-gtfs-capabilities-with-parsing-into-simple-features/'
shapes_as_sf <- function(df) {
sfc <- df %>% # long data frame
arrange(shape_pt_sequence) %>% # essentiel !
split(.$shape_id) %>% # list of shorted data framee, one per shape
map(~ select(., shape_pt_lon, shape_pt_lat) %>% # order maters, lon-1st lat-2nd
as.matrix %>% # coherce for st_linestrings happiness
st_linestring) %>%
st_sfc(crs = 4326) # bundle all shapes into a collection

nc <- unique(df$shape_id) %>%
sort() %>% # sort to match with names(sfc); split()'s factor-cohercion alpha sorts
st_sf(shape_id = ., geometry = sfc) %>%
glimpse()
}

routes.sf

library(osmdata)
library(tidyverse)
q <- opq(bbox = c(51.1, 0.1, 51.2, 0.2))
osm.sf <- osmdata_sf(q, f_osm) %>%
glimpse()
routes.sf <- osm.sf$osm_multilines

Please provide working code:

> tt <- tidytransit::read_gtfs(dsn)
Error: object 'dsn' not found

I'm sorry I miss to copy a line
library(sf)
library(tidyverse)
library(tidytransit)
dsn <- 'https://framagit.org/mgageo/osmdata/raw/master/OSMDATA/gtfs.zip'
tt <- tidytransit::read_gtfs(dsn)
shapes.sf <- shapes_as_sf(tt$shapes) %>%
glimpse()

Many thanks for sf, it's a very useful package

> osm.sf <- osmdata_sf(q, f_osm) %>%
+ glimpse()
Error in fill_overpass_data(obj, doc, quiet = quiet) : 
  object 'f_osm' not found

library(sf)
library(tidyverse)
library(osmdata)
dsn <- 'https://framagit.org/mgageo/osmdata/raw/master/OSMDATA/relations_route_bus.osm'
fic <- tempfile('osmdata', fileext='.osm')
download.file(dsn, fic, mode = "wb")
q <- opq(bbox = c(51.1, 0.1, 51.2, 0.2))
osm.sf <- osmdata_sf(q, fic)
routes.sf <- osm.sf$osm_multilines %>%
glimpse()

Several of the objects returned by osmdata_sf have a geometry list column with a names attribute. That is the cause of your downstream trouble. You may want to open an issue with package osmdata. @mpadge we had names on feature geometries once, then abandoned it because (IIRC) it messes up everything in mapview (@tim-salabim : am I right?). This issue is another case where it messes up downstream handling.

Yes, @edzer, you're right that it messes up mapview, because of leaflet. These names are nevertheless an inviolable part of osmdata, and leaflet is supposed to be implementing changes to obviate this problem (issue here). That was in May this year, so still coming I guess ... In the meantime, best solution is for affected code to insert a one-liner to check if names exist, and remove (names(.)<-NULL) if so.

The issue here suggests that named geometries are not only a problem for leaflet or mapview, but also for left_join, so will not go away when leaflet is modified. As named geometries are not part of the sf API, I cannot recommend your "in the meantime" solution, but recommend to not set them in osmdata.

The issue also came up when we wanted to read the feature ID in GDAL. Since names on list columns cause trouble, we now (optionally) put them in a column of an sf object; see parameter fid_column_name. Why don't you do the same thing with the IDs you create in osmdata?

I could remove the names of the geometries (and may do as per the issue referenced above), but for leaflet at least, the underlying problem is actually that the coordinate matrices within the sf geometries have rownames. These are, however, utterly essential, because OSM itself has a very strict policy that defines intersections as ID-based only, and never based on coordinates. This is in turn related to OSM not including any z component (and not having any plans to do so any time in the future). Without z values, the only way to identify vertically-mismatched yet horizontally coincident intersection points is through ID values. So within OSM, it is essential that all LINESTRING objects, for example, retain the unique ID values of all of their constituent points. The only way to include these within an sf data.frame is to store them as rownames of the coordinate matrix, and it is these rownames that are the real problem here.

What i probably should do is to implement a keep_rownames = FALSE parameter that strips these problematic rownames by default, but enables them to be kept when needed, along with sufficient explanation of the potential consequences of doing so (like the fact that the returned object ought not be considered a "true" sf object; rather something that can be used by many, but not all sf routines). Would that be acceptable @edzer?

I feel that this is something that should be handled on the leaflet side, possibly even in JS not R. They are requiring that valuable information is discarded. Named vectors/list may also be untidy, but are justified when we need to make fish from fish soup. Recall that in PROJ it is being considered whether all coordinates should also have a timestamped CRS too, so matrix rownames would be needed for keying one to another. I don't think that leaflet is a robust standard.

@mpadge you are raising an entirely new issue. The issue here is that names on sfc objects break things downstream, not only in leaflet but also in tidyverse, and that osmdata sets them.

Please open a new issue if you want to discuss the issue of row.names on coordinate matrices; this is a thing that sf never does so right now is not considered to be part of "the sf api".

Why do names on list column components cause trouble and for whom? Shouldn't the downstream users of properly configured lists attend to this themselves? Where is the need to drop list column names specified? Base R supports/expects lists to have named components to permit $ etc., to work for access.

See this issue, and the other one related to leaflet. Anyone should feel free to _make_ them work, by submitting a PR.

The issue in leaflet is caused by rownames, not by named items in the geometry list

OK, I understand.

Was this page helpful?
0 / 5 - 0 ratings