Sf: st_read and requiring layer name

Created on 28 Oct 2016 · 44Comments · Source: r-spatial/sf

This is a small pet peeve of mine with regard to rgdal - is there a compelling reason to continue to force the user to provide the layer name when reading in data?

There isn't a technical reason within gdal to require this and in my experience most geometry files I've worked with only ever have one layer. It seems to me a more sensible approach is to make the layer name an optional argument and in the case of a single layer to just automatically return that object.

If more than one layer are present then the current error could be thrown, or alternatively throw a warning and return a list containing all layers.

Source

rundel

Most helpful comment

Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases.

My only small quibble is including return_layers as an argument with st_read makes the function a bit monolithic and breaks type stability. Why not include an st_info function with ogrinfo like functionality which can then be used for retrieving layer informatio?

rundel on 1 Nov 2016

👍3

All 44 comments

Nice idea, would improve usability! What would be a good approach to try to derive layer name from a single string: remove the file name extension and possibly path from the file name?

edzer on 29 Oct 2016

I'm not sure, my use cases have almost always been geojson and shapefiles so I don't have direct experience with the more exotic vector types and their usage of layer names.

GDAL peculiarities mean that both of the following work without issue

nc <- st_read(system.file("shape/nc.shp", package="sf"), "nc", crs = 4267)
nc <- st_read(system.file("shape/", package="sf"), "nc", crs = 4267)

so I'm not sure if just looking at the path will be helpful.

The example I have the most issue with has always been with reading geojson where the user is expected to know that the layer is called OGRGeoJSON - that has always been a pain point when my students are first learning this stuff.

rundel on 29 Oct 2016

I agree with the changes suggested by @rundel. That will increase usability lots. It was a godsend when I found out about raster::shapefile() and I think st_read() would benefit from working in the same or a similar way.

I think it's easiest to just point it at the file automatically deciding the driver/layer based on the extension. This is what rio does and I think sf should too: https://cran.r-project.org/web/packages/rio/vignettes/rio.html

Robinlovelace on 30 Oct 2016

So these might be some heuristics then: if st_read is called with a single argument, it must be an existing file, and

if the file name ends in .shp, we assume it is a shapefile (with full path), and call the layer the basename (i.e. extension and path stripped) of the file name
Geopackages end in .gpkg (see req 3, so use the same trick as for shapefiles;
Other file types with fixed extension insert here
other file names, we assume will be GeoJSON, and layer will be called OGRGeoJSON (see here)

edzer on 30 Oct 2016

👍3

Some other common extensions that I think would be good for sf to recognise:

.geojson
.osm
.gml
.kml
.kmz

I think raster already does a good job of handling the multiple raster formats.

Robinlovelace on 30 Oct 2016

Examples of using st_read on .geojson .osm, .gml, .kml and .kmz would be helpful!

edzer on 30 Oct 2016

Please be careful to separate out st_read for dummies and the real st_read, which shouldn't make assumptions. A typical problem is with the shapefile driver which falls over in a dsn=directory with a given layer=, but where the directory contains a *.dbf without being part of a shapefile. Robert's use of shapefile as a general term for vector data is very unhelpful. The border between sensible simplification and going too far is very real. Something not (yet) in the frame is encoding, which is often very unpleasant (even CP1252 is a problem), and far from everyone is UTF-8 yet. So either do a wrapper for st_read which handles frequent cases, and explain that more often than we like, real data is messy and needs the underlying complications in the bare interface, or just provide the bare interface and enough guidance. Fixing GeoJSON for the current odd OGR naming may change as soon as the driver gets updated - there are reasons for exposing users to the actual complications, because it makes derailments their responsibility.

rsbivand on 30 Oct 2016

Example of a 5 Mb file I created - cycling potential on major roads in London:

u = "https://github.com/npct/pct-data/raw/master/london/rnet.geojson"
download.file(url = u, destfile = "/tmp/rnet.geojson")
rnet = sf::st_read("/tmp/rnet.geojson") # fails
rnet = sf::st_read(dsn = "/tmp/rnet.geojson", layer = "OGRGeoJSON") # works

I take Roger's point about precision and robustness vs ease of use and like the idea of having a separate wrapper function for magically guessing, most of the time correctly, how to read in a file. For that I would propose something with a name like st_import(), akin to rio::import().

Would that satisfy all parties?

rnet <- sf::st_import("/tmp/rnet.geojson")

Would work for me and think it's a function I can even help writing!

Robinlovelace on 30 Oct 2016

I will not provide two functions. We can give layer a default name where this makes sense, and an error message when it is missing otherwise (e.g. where dsn is a directory, or an extension is missing). Like rgdal, the function responds with printing which layer is being read from which data source, on success, so that is there. Even ogr2ogr doesn't force you to specify layers if they're obvious.

edzer on 30 Oct 2016

Sounds reasonable and can understand desire to minimise number of functions. So, to clarify, that would mean st_read("file.geojson") and st_read("file.shp") would work? I think that would make life easier for lots of people, myself included, especially if I end up teaching sf which I hope will happen based on the impressive rate of development.

Robinlovelace on 30 Oct 2016

no, because that doesn't point to a file with an extension that tells us what the layer name should be. I just pushed a commit that accepts

 st_read(system.file("shape/nc.shp", package="sf"))

though -- please test (now supports gpkg, geojson, shp).

To be discussed: what will it return as default?

edzer on 30 Oct 2016

here's an example xml (.osm) file. Not sure if this can be read with rgdal or sf at present. Btw - fast work Edzer! Will aim to test during the week, maybe Weds.
doc_txt.osm.zip

Robinlovelace on 30 Oct 2016

Could you clarify what you mean by "that doesn't point to a file with an extension that tells us what the layer name should be" - not understanding 100%.

Robinlovelace on 30 Oct 2016

I meant: literally the command you typed. It has no file extension since the "/" comes after the ".". If you mean, either

st_read(file.geojson")

st_read("file.shp")

then yes, but please don't imply meaning in github issues -- I can only read what you write.

Which layer in your .osm? ogrinfo reports 5:

edzer@gin-edzer:/tmp$ ogrinfo doc_txt.osm
Had to open data source read-only.
INFO: Open of `doc_txt.osm'
      using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)

edzer on 30 Oct 2016

library(sf)
o1 = st_read("doc_txt.osm", "points")
o2 = st_read("doc_txt.osm", "lines")
o3 = st_read("doc_txt.osm", "multilinestrings")
o4 = st_read("doc_txt.osm", "multipolygons")
o5 = st_read("doc_txt.osm", "other_relations")

plot(o1)
plot(o2, add = TRUE, col = 'red')
plot(o3, add = TRUE, col = 'blue')
plot(o4, add = TRUE, col = 'green')
plot(o5, add = TRUE, col = 'orange')

(the last two are empty)

edzer on 30 Oct 2016

Many thanks.

Robinlovelace on 31 Oct 2016

I changed strategy somewhat. In case no layer is specified, st_read now checks with GDAL: if there is just one layer, that layer is read; if there is more than one, the list with layers is returned. This avoids the step of having to guess from extensions etc:

> st_read("PG:dbname=postgis")
Driver: PostgreSQL 
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point

edzer on 31 Oct 2016

👍1

Very cool - works a treat, as tested on the osm data:

# devtools::install_github("edzer/sfr") # latest version
library(sf)

# user: now I want to read in an osm file...
osm_data = st_read(dsn = "/tmp/doc_txt.osm")
# user: great - now I'll plot it...
# plot(osm_data) # fail
# user: what's in it?
osm_data

## Driver: OSM 
##         layer_name       geometry_type
## 1           points               Point
## 2            lines         Line String
## 3 multilinestrings   Multi Line String
## 4    multipolygons       Multi Polygon
## 5  other_relations Geometry Collection

# user: ah ok, now I'll read-in the points
osm_points = st_read(dsn = "/tmp/doc_txt.osm", layer = "points")

## Reading layer points from data source /tmp/doc_txt.osm using driver "OSM"
## features:       932
## fields:         10
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

plot(osm_points)

I think there should be a message telling the user that no data has been read in when this happens though, e.g. There are multiple layers in this data source. Set the layer argument in st_read to read the desired layer., in the same way that there is a useful message when the data does get read in.

Robinlovelace on 31 Oct 2016

Yes, I think there should be a message, either:

No layer (what then?)
One layer (read that layer)
Many layers (print a message - should it be a normal message/warning/error?)

Nowosad on 31 Oct 2016

And the message could also say something like

Available layers are:
## Driver: OSM 
##         layer_name       geometry_type
## 1           points               Point
## 2            lines         Line String
## 3 multilinestrings   Multi Line String
## 4    multipolygons       Multi Polygon
## 5  other_relations Geometry Collection

Robinlovelace on 31 Oct 2016

OK, I now have:

> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> x = st_read("PG:dbname=empty")   # no layers
No layers are present in data source PG:dbname=empty.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL 
Available layers:
<none>
> x = st_read("x.geojson")         # single layer
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features:       3
fields:         2
proj4string:    +proj=longlat +datum=WGS84 +no_defs 
> x = st_read("PG:dbname=postgis") # multiple layers
Multiple layers are present in data source PG:dbname=postgis.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL 
Available layers:
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point
> x = st_read("PG:dbname=empty", quiet = TRUE)   # no layers, quiet
>

edzer on 31 Oct 2016

Looks like you've cracked it - can't think of any way to make that better. Your work will make loading geographic data into R more accessible for thousands of people when this takes off!

Robinlovelace on 1 Nov 2016

Although from a grammar perspective I don't think the semi colon works, hence https://github.com/edzer/sfr/pull/43/commits/ecea519a9262b5ef18c91e930d6f08e469958bed

(I know, I'm painting bike sheds again but someone's got to do it!)

Robinlovelace on 1 Nov 2016

I'd say job done and this issue can be closed. This issue has led to an improvement in usability and, although you still need to provide a layer name, it's now clear where to find those names. Solved from your perspective @rundel?

Robinlovelace on 1 Nov 2016

I'm late to this, but concerned that you can now get a data set from st_read, or an error, or a list with layer names, which will present strange errors downstream when you try to do stuff as if it were a data set. To me it should be an error, or a valid data set (or NULL, maybe if the data source can execute queries).

Hadley calls this "type-stability": https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ (and I'm a massive offender in my own packages, but it's important I think. )

Is there a function to get the layer names without using st_read, like ogrListLayers?

If you consider multiple NetCDF variables in a single file to be analogous to vector layers: I've always liked raster's behaviour to read the first variable in a NetCDF file as the one to read if the "varname" is not specified, with a message about that decision including a list of the available variable names. That way you get a data set without specifying a variable, but also an indication that a choice for the "varname" argument was made for you.

The behaviour would be to have an implicit default value of 1L for "layer" to act as an index into the list of available layers, and to trigger a message with the details in that case. Then an explicit integer for layer could stand in place of a name - sometimes you want to read everything and push it around somewhere else, without knowing what it is.

I'd meant to suggest this behaviour, since it's such a huge obstacle for newcomers who start by knowing nothing about multiple layers, and then (from what I've seen), turn from readOGR to the more obvious tools in maptools and raster to read their vector data.

(All that said, the current behaviour is a huge user-friendly improvement over readOGR, so I'm very glad to see it.)

mdsumner on 1 Nov 2016

Interesting points @mdsumner. I like the option of providing a NULL output if there are many layers and print the layer names. Then a separate function, e.g. st_listlayers() could actually provide the output. Can see benefit of this in terms of making function 'modular'.

Certainly agree with the general principle of type stability.

Robinlovelace on 1 Nov 2016

I like type stability. How about this (in case no layer is specified):

reading an empty dataset generates an error
reading a dataset with a single layer simply reads this layer, as above
reading a dataset with multiple layers reads the first layer, prints all layers, and emits a warning (unless quiet = TRUE)
add an option return_layers that, when TRUE returns the list of layers in cases 1, 2 and 3.

edzer on 1 Nov 2016

👍2

This looks at least reasonable, assuming one places more importance on user-friendliness than I would (user-friendliness as the opposite of the very beneficial steep learning curve). Should readOGR move to this structure too?

rsbivand on 1 Nov 2016

Sound like good suggestions @edzer. I think it would do no harm for readOGR to shift to this behavior from @rsbivand, can't see any unintended consequences and would be beneficial for people new to spatial data files.

Robinlovelace on 1 Nov 2016

👍2

Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases.

rundel on 1 Nov 2016

👍3

@edzer @rsbivand - what's are the future plans for readOGR? Will st_read and readOGR be used by users at the same time or readOGR will be replaced by st_read? If the second is true - I think is better to leave readOGR as it is - many scripts/tutorials/books are based on the dsn and layer behaviour of readOGR.

Nowosad on 1 Nov 2016

They are separate. rgdal will go to maintenance mode for vector, but just as maptools hasn't been deprecated, I guess rgdal vector will not be either. I'd just change a missing layer= argument from:
if (missing(layer)) stop("missing layer") to the suggested approach, and similar for ogrInfo().

rsbivand on 1 Nov 2016

OK, I currently see:

> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> try(x <- st_read("PG:dbname=empty")) # no layers: error
Data source PG:dbname=empty contains no layers
Error in eval(substitute(expr), envir, enclos) : 
  Error: no layers in datasource.

> x = st_read("x.geojson")         # single layer: silent
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features:       3
fields:         2
proj4string:    +proj=longlat +datum=WGS84 +no_defs 
> x = st_read("PG:dbname=postgis") # multiple layers: message + warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features:       155
fields:         12
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs 
Warning message:
In eval(substitute(expr), envir, enclos) :
  automatically selected the first layer in a data source containing more than one
> x = st_read("PG:dbname=postgis", quiet = TRUE) # multiple layers, suppress message
Warning message:
In eval(substitute(expr), envir, enclos) :
  automatically selected the first layer in a data source containing more than one
> suppressWarnings(x <- st_read("PG:dbname=postgis")) # multiple layers, suppress warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features:       155
fields:         12
proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs 
>

and

> st_list("PG:dbname=postgis")
Driver: PostgreSQL 
Available layers:
  layer_name geometry_type
1      meuse         Point
2   meuse_sf         Point
3       sids Multi Polygon
4  meuse_tbl         Point
5 meuse_tbl2         Point

so that should settle this issue.

edzer on 1 Nov 2016

Very comprehensive solution and demonstration of the new functionality, that works for me and it seems @rundel who usefully opened this issue is happy. Job done I'd say!

Robinlovelace on 1 Nov 2016

Yes, very good indeed. Thanks all (and especially @edzer for the work!)

mdsumner on 1 Nov 2016

Looks great to me, thanks for all the hard work @edzer

rundel on 1 Nov 2016

Perhaps this is a local issue or my own ignorance (likely), but as I start to develop on mapedit, I have had a very difficult time easily converting geojson as a list to sf. Here is an example using a drawn polygon returned from Leaflet.Draw.

drawn <- list(structure(list(type = "Feature", properties = structure(list(
    `_leaflet_id` = 102L, feature_type = "rectangle"), .Names = c("_leaflet_id", 
"feature_type")), geometry = structure(list(type = "Polygon", 
    coordinates = list(list(list(10.0854730652645, 48.8285498294904), 
        list(10.0854730652645, 49.393084372626), list(11.123681073077, 
            49.393084372626), list(11.123681073077, 48.8285498294904), 
        list(10.0854730652645, 48.8285498294904)))), .Names = c("type", 
"coordinates"))), .Names = c("type", "properties", "geometry"
)))

library(leaflet)
# draw it with leaflet
leaflet() %>% addTiles() %>% addGeoJSON(drawn)

What would be the correct way to convert drawn to sf? The only thing I can get to work is

library(sf)

st_polygon(
    list(
        matrix(
            unlist(drawn[[1]]$geometry$coordinates),
            byrow=TRUE,
            ncol=2
        )
    )
)

Thanks so much for sf. It is already amazing!

timelyportfolio on 19 Jan 2017

Yes, this is how it works. I'd have opted for

do.call(rbind, unlist(drawn[[1]]$geometry$coordinates))

but that really doesn't matter. The extra list nesting however suggests that coordinates may have more rings, in which case you need to do more - you'd need to know whether they are holes or other outer rings. @mdsumner might know this area better?

edzer on 19 Jan 2017

Ok good to know, thanks! As far as I am aware, Leaflet.Draw does not support rings or holes, so I think I am safe in this case. However, I would always appreciate input from @mdsumner. For further reference, the Shiny gadget proof of concept is here.

timelyportfolio on 19 Jan 2017

@timelyportfolio I've implemented sf -> geojson (as a list) in geojsonio: https://github.com/ropensci/geojsonio/blob/master/R/geojson_list.R#L294. That may be of help to go backwards. Also, I'd love to have the same method you're working on in geojsonio... 😉

ateucher on 19 Jan 2017

@timelyportfolio it will show holes correctly by virtue of the evenodd rule

drawn <-
  list(structure(
    list(
      type = "Feature",
      properties = structure(
        list(`_leaflet_id` = 102L, feature_type = "rectangle"),
        .Names = c("_leaflet_id",
                   "feature_type")
      ),
      geometry = structure(
        list(type = "Polygon",
             coordinates = list(
               list(
               list(10.0854730652645, 48.8285498294904),
               list(10.0854730652645, 49.393084372626),
               list(11.123681073077,
                    49.393084372626),
               list(11.123681073077, 48.8285498294904),
               list(10.0854730652645, 48.8285498294904)
             ), 
             list(
               list(10.3, 48.9), list(10.3, 49.1), list(10.7, 49.1), list(10.7, 48.9), 
               list(10.3, 48.9)))),
        .Names = c("type",
                   "coordinates")
      )
    ),
    .Names = c("type", "properties", "geometry")
  ))

Note that leaflet the R package will get support for true leaflet-MultiPolygon, but it doesn't have it yet.

I'm confused about why addGeoJSON takes an R list rather than a string though, we really need some clarity in this area, and some semantics for transferring between raw formats (like a string, or binary format), their analogous representation within R, and the form needed to drive another tool. Sometimes there are too many steps in the chain.
(I often encounter the idea that "simple features" is the right central form but it's clearly not able to represent everything we need, TopoJSON for example. I am working on the design of that central form but it's still a little early). We don't necessarily need a universal form all the time, but there is a need for guidance on what chained steps make sense in terms of efficiency and preserving all information with minimal redundancy.

And clearly we need better tools for GeoJSON and TopoJSON, rather than hacking these internal forms, we are half-ish way through that with some tools here - I'm not sure how what we are doing meshes with geojsonio, yet. TopoJSON requires partial indexing between shared tables (or structures), so it's a good bridge for what the broader central scheme I've been working towards can provide.

mdsumner on 20 Jan 2017

Maybe this, and linking to @ateucher helps?

edzer on 20 Jan 2017

Sorry, I now realize that my previous comment probably doesn't make sense in this context.

edzer on 20 Jan 2017

Thanks for all the input. I will promote to a new issue for geojson as R list to sf.

timelyportfolio on 20 Jan 2017

Was this page helpful?

0 / 5 - 0 ratings