This is a small pet peeve of mine with regard to rgdal - is there a compelling reason to continue to force the user to provide the layer name when reading in data?
There isn't a technical reason within gdal to require this and in my experience most geometry files I've worked with only ever have one layer. It seems to me a more sensible approach is to make the layer name an optional argument and in the case of a single layer to just automatically return that object.
If more than one layer are present then the current error could be thrown, or alternatively throw a warning and return a list containing all layers.
Nice idea, would improve usability! What would be a good approach to try to derive layer name from a single string: remove the file name extension and possibly path from the file name?
I'm not sure, my use cases have almost always been geojson and shapefiles so I don't have direct experience with the more exotic vector types and their usage of layer names.
GDAL peculiarities mean that both of the following work without issue
nc <- st_read(system.file("shape/nc.shp", package="sf"), "nc", crs = 4267)
nc <- st_read(system.file("shape/", package="sf"), "nc", crs = 4267)
so I'm not sure if just looking at the path will be helpful.
The example I have the most issue with has always been with reading geojson where the user is expected to know that the layer is called OGRGeoJSON - that has always been a pain point when my students are first learning this stuff.
I agree with the changes suggested by @rundel. That will increase usability lots. It was a godsend when I found out about raster::shapefile() and I think st_read() would benefit from working in the same or a similar way.
I think it's easiest to just point it at the file automatically deciding the driver/layer based on the extension. This is what rio does and I think sf should too: https://cran.r-project.org/web/packages/rio/vignettes/rio.html
So these might be some heuristics then: if st_read is called with a single argument, it must be an existing file, and
.shp, we assume it is a shapefile (with full path), and call the layer the basename (i.e. extension and path stripped) of the file name.gpkg (see req 3, so use the same trick as for shapefiles;OGRGeoJSON (see here)Some other common extensions that I think would be good for sf to recognise:
.geojson
.osm
.gml
.kml
.kmz
I think raster already does a good job of handling the multiple raster formats.
Examples of using st_read on .geojson .osm, .gml, .kml and .kmz would be helpful!
Please be careful to separate out st_read for dummies and the real st_read, which shouldn't make assumptions. A typical problem is with the shapefile driver which falls over in a dsn=directory with a given layer=, but where the directory contains a *.dbf without being part of a shapefile. Robert's use of shapefile as a general term for vector data is very unhelpful. The border between sensible simplification and going too far is very real. Something not (yet) in the frame is encoding, which is often very unpleasant (even CP1252 is a problem), and far from everyone is UTF-8 yet. So either do a wrapper for st_read which handles frequent cases, and explain that more often than we like, real data is messy and needs the underlying complications in the bare interface, or just provide the bare interface and enough guidance. Fixing GeoJSON for the current odd OGR naming may change as soon as the driver gets updated - there are reasons for exposing users to the actual complications, because it makes derailments their responsibility.
Example of a 5 Mb file I created - cycling potential on major roads in London:
u = "https://github.com/npct/pct-data/raw/master/london/rnet.geojson"
download.file(url = u, destfile = "/tmp/rnet.geojson")
rnet = sf::st_read("/tmp/rnet.geojson") # fails
rnet = sf::st_read(dsn = "/tmp/rnet.geojson", layer = "OGRGeoJSON") # works
I take Roger's point about precision and robustness vs ease of use and like the idea of having a separate wrapper function for magically guessing, most of the time correctly, how to read in a file. For that I would propose something with a name like st_import(), akin to rio::import().
Would that satisfy all parties?
rnet <- sf::st_import("/tmp/rnet.geojson")
Would work for me and think it's a function I can even help writing!
I will not provide two functions. We can give layer a default name where this makes sense, and an error message when it is missing otherwise (e.g. where dsn is a directory, or an extension is missing). Like rgdal, the function responds with printing which layer is being read from which data source, on success, so that is there. Even ogr2ogr doesn't force you to specify layers if they're obvious.
Sounds reasonable and can understand desire to minimise number of functions. So, to clarify, that would mean st_read("file.geojson") and st_read("file.shp") would work? I think that would make life easier for lots of people, myself included, especially if I end up teaching sf which I hope will happen based on the impressive rate of development.
no, because that doesn't point to a file with an extension that tells us what the layer name should be. I just pushed a commit that accepts
st_read(system.file("shape/nc.shp", package="sf"))
though -- please test (now supports gpkg, geojson, shp).
To be discussed: what will it return as default?
Could you clarify what you mean by "that doesn't point to a file with an extension that tells us what the layer name should be" - not understanding 100%.
I meant: literally the command you typed. It has no file extension since the "/" comes after the ".". If you mean, either
st_read(file.geojson")
or
st_read("file.shp")
then yes, but please don't imply meaning in github issues -- I can only read what you write.
Which layer in your .osm? ogrinfo reports 5:
edzer@gin-edzer:/tmp$ ogrinfo doc_txt.osm
Had to open data source read-only.
INFO: Open of `doc_txt.osm'
using driver `OSM' successful.
1: points (Point)
2: lines (Line String)
3: multilinestrings (Multi Line String)
4: multipolygons (Multi Polygon)
5: other_relations (Geometry Collection)
library(sf)
o1 = st_read("doc_txt.osm", "points")
o2 = st_read("doc_txt.osm", "lines")
o3 = st_read("doc_txt.osm", "multilinestrings")
o4 = st_read("doc_txt.osm", "multipolygons")
o5 = st_read("doc_txt.osm", "other_relations")
plot(o1)
plot(o2, add = TRUE, col = 'red')
plot(o3, add = TRUE, col = 'blue')
plot(o4, add = TRUE, col = 'green')
plot(o5, add = TRUE, col = 'orange')
(the last two are empty)
Many thanks.
I changed strategy somewhat. In case no layer is specified, st_read now checks with GDAL: if there is just one layer, that layer is read; if there is more than one, the list with layers is returned. This avoids the step of having to guess from extensions etc:
> st_read("PG:dbname=postgis")
Driver: PostgreSQL
layer_name geometry_type
1 meuse Point
2 meuse_sf Point
3 sids Multi Polygon
4 meuse_tbl Point
5 meuse_tbl2 Point
Very cool - works a treat, as tested on the osm data:
# devtools::install_github("edzer/sfr") # latest version
library(sf)
# user: now I want to read in an osm file...
osm_data = st_read(dsn = "/tmp/doc_txt.osm")
# user: great - now I'll plot it...
# plot(osm_data) # fail
# user: what's in it?
osm_data
## Driver: OSM
## layer_name geometry_type
## 1 points Point
## 2 lines Line String
## 3 multilinestrings Multi Line String
## 4 multipolygons Multi Polygon
## 5 other_relations Geometry Collection
# user: ah ok, now I'll read-in the points
osm_points = st_read(dsn = "/tmp/doc_txt.osm", layer = "points")
## Reading layer points from data source /tmp/doc_txt.osm using driver "OSM"
## features: 932
## fields: 10
## proj4string: +proj=longlat +datum=WGS84 +no_defs
plot(osm_points)
I think there should be a message telling the user that no data has been read in when this happens though, e.g. There are multiple layers in this data source. Set the layer argument in st_read to read the desired layer., in the same way that there is a useful message when the data does get read in.
Yes, I think there should be a message, either:
And the message could also say something like
Available layers are:
## Driver: OSM
## layer_name geometry_type
## 1 points Point
## 2 lines Line String
## 3 multilinestrings Multi Line String
## 4 multipolygons Multi Polygon
## 5 other_relations Geometry Collection
OK, I now have:
> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> x = st_read("PG:dbname=empty") # no layers
No layers are present in data source PG:dbname=empty.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL
Available layers:
<none>
> x = st_read("x.geojson") # single layer
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features: 3
fields: 2
proj4string: +proj=longlat +datum=WGS84 +no_defs
> x = st_read("PG:dbname=postgis") # multiple layers
Multiple layers are present in data source PG:dbname=postgis.
Returning a list of layer names and their type;
set the `layer' argument in `st_read' to read a particular layer.
> x
Driver: PostgreSQL
Available layers:
layer_name geometry_type
1 meuse Point
2 meuse_sf Point
3 sids Multi Polygon
4 meuse_tbl Point
5 meuse_tbl2 Point
> x = st_read("PG:dbname=empty", quiet = TRUE) # no layers, quiet
>
Looks like you've cracked it - can't think of any way to make that better. Your work will make loading geographic data into R more accessible for thousands of people when this takes off!
Although from a grammar perspective I don't think the semi colon works, hence https://github.com/edzer/sfr/pull/43/commits/ecea519a9262b5ef18c91e930d6f08e469958bed
(I know, I'm painting bike sheds again but someone's got to do it!)
I'd say job done and this issue can be closed. This issue has led to an improvement in usability and, although you still need to provide a layer name, it's now clear where to find those names. Solved from your perspective @rundel?
I'm late to this, but concerned that you can now get a data set from st_read, or an error, or a list with layer names, which will present strange errors downstream when you try to do stuff as if it were a data set. To me it should be an error, or a valid data set (or NULL, maybe if the data source can execute queries).
Hadley calls this "type-stability": https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ (and I'm a massive offender in my own packages, but it's important I think. )
Is there a function to get the layer names without using st_read, like ogrListLayers?
If you consider multiple NetCDF variables in a single file to be analogous to vector layers: I've always liked raster's behaviour to read the first variable in a NetCDF file as the one to read if the "varname" is not specified, with a message about that decision including a list of the available variable names. That way you get a data set without specifying a variable, but also an indication that a choice for the "varname" argument was made for you.
The behaviour would be to have an implicit default value of 1L for "layer" to act as an index into the list of available layers, and to trigger a message with the details in that case. Then an explicit integer for layer could stand in place of a name - sometimes you want to read everything and push it around somewhere else, without knowing what it is.
I'd meant to suggest this behaviour, since it's such a huge obstacle for newcomers who start by knowing nothing about multiple layers, and then (from what I've seen), turn from readOGR to the more obvious tools in maptools and raster to read their vector data.
(All that said, the current behaviour is a huge user-friendly improvement over readOGR, so I'm very glad to see it.)
Interesting points @mdsumner. I like the option of providing a NULL output if there are many layers and print the layer names. Then a separate function, e.g. st_listlayers() could actually provide the output. Can see benefit of this in terms of making function 'modular'.
Certainly agree with the general principle of type stability.
I like type stability. How about this (in case no layer is specified):
return_layers that, when TRUE returns the list of layers in cases 1, 2 and 3.This looks at least reasonable, assuming one places more importance on user-friendliness than I would (user-friendliness as the opposite of the very beneficial steep learning curve). Should readOGR move to this structure too?
Sound like good suggestions @edzer. I think it would do no harm for readOGR to shift to this behavior from @rsbivand, can't see any unintended consequences and would be beneficial for people new to spatial data files.
Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases.
My only small quibble is including return_layers as an argument with st_read makes the function a bit monolithic and breaks type stability. Why not include an st_info function with ogrinfo like functionality which can then be used for retrieving layer informatio?
@edzer @rsbivand - what's are the future plans for readOGR? Will st_read and readOGR be used by users at the same time or readOGR will be replaced by st_read? If the second is true - I think is better to leave readOGR as it is - many scripts/tutorials/books are based on the dsn and layer behaviour of readOGR.
They are separate. rgdal will go to maintenance mode for vector, but just as maptools hasn't been deprecated, I guess rgdal vector will not be either. I'd just change a missing layer= argument from:
if (missing(layer)) stop("missing layer") to the suggested approach, and similar for ogrInfo().
OK, I currently see:
> library(sf)
Linking to GEOS 3.5.0, GDAL 2.1.0
> try(x <- st_read("PG:dbname=empty")) # no layers: error
Data source PG:dbname=empty contains no layers
Error in eval(substitute(expr), envir, enclos) :
Error: no layers in datasource.
> x = st_read("x.geojson") # single layer: silent
Reading layer OGRGeoJSON from data source x.geojson using driver "GeoJSON"
features: 3
fields: 2
proj4string: +proj=longlat +datum=WGS84 +no_defs
> x = st_read("PG:dbname=postgis") # multiple layers: message + warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features: 155
fields: 12
proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs
Warning message:
In eval(substitute(expr), envir, enclos) :
automatically selected the first layer in a data source containing more than one
> x = st_read("PG:dbname=postgis", quiet = TRUE) # multiple layers, suppress message
Warning message:
In eval(substitute(expr), envir, enclos) :
automatically selected the first layer in a data source containing more than one
> suppressWarnings(x <- st_read("PG:dbname=postgis")) # multiple layers, suppress warning
Multiple layers are present in data source PG:dbname=postgis.
Reading layer `meuse'.
Use `st_list' to list all layer names and their type.
Set the `layer' argument in `st_read' to read a particular layer.
Reading layer meuse from data source PG:dbname=postgis using driver "PostgreSQL"
features: 155
fields: 12
proj4string: +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957388243134,0.343987817378283,-1.87740163998045,4.0725 +units=m +no_defs
>
and
> st_list("PG:dbname=postgis")
Driver: PostgreSQL
Available layers:
layer_name geometry_type
1 meuse Point
2 meuse_sf Point
3 sids Multi Polygon
4 meuse_tbl Point
5 meuse_tbl2 Point
so that should settle this issue.
Very comprehensive solution and demonstration of the new functionality, that works for me and it seems @rundel who usefully opened this issue is happy. Job done I'd say!
Yes, very good indeed. Thanks all (and especially @edzer for the work!)
Looks great to me, thanks for all the hard work @edzer
Perhaps this is a local issue or my own ignorance (likely), but as I start to develop on mapedit, I have had a very difficult time easily converting geojson as a list to sf. Here is an example using a drawn polygon returned from Leaflet.Draw.
drawn <- list(structure(list(type = "Feature", properties = structure(list(
`_leaflet_id` = 102L, feature_type = "rectangle"), .Names = c("_leaflet_id",
"feature_type")), geometry = structure(list(type = "Polygon",
coordinates = list(list(list(10.0854730652645, 48.8285498294904),
list(10.0854730652645, 49.393084372626), list(11.123681073077,
49.393084372626), list(11.123681073077, 48.8285498294904),
list(10.0854730652645, 48.8285498294904)))), .Names = c("type",
"coordinates"))), .Names = c("type", "properties", "geometry"
)))
library(leaflet)
# draw it with leaflet
leaflet() %>% addTiles() %>% addGeoJSON(drawn)
What would be the correct way to convert drawn to sf? The only thing I can get to work is
library(sf)
st_polygon(
list(
matrix(
unlist(drawn[[1]]$geometry$coordinates),
byrow=TRUE,
ncol=2
)
)
)
Thanks so much for sf. It is already amazing!
Yes, this is how it works. I'd have opted for
do.call(rbind, unlist(drawn[[1]]$geometry$coordinates))
but that really doesn't matter. The extra list nesting however suggests that coordinates may have more rings, in which case you need to do more - you'd need to know whether they are holes or other outer rings. @mdsumner might know this area better?
Ok good to know, thanks! As far as I am aware, Leaflet.Draw does not support rings or holes, so I think I am safe in this case. However, I would always appreciate input from @mdsumner. For further reference, the Shiny gadget proof of concept is here.
@timelyportfolio I've implemented sf -> geojson (as a list) in geojsonio: https://github.com/ropensci/geojsonio/blob/master/R/geojson_list.R#L294. That may be of help to go backwards. Also, I'd love to have the same method you're working on in geojsonio... 馃槈
@timelyportfolio it will show holes correctly by virtue of the evenodd rule
drawn <-
list(structure(
list(
type = "Feature",
properties = structure(
list(`_leaflet_id` = 102L, feature_type = "rectangle"),
.Names = c("_leaflet_id",
"feature_type")
),
geometry = structure(
list(type = "Polygon",
coordinates = list(
list(
list(10.0854730652645, 48.8285498294904),
list(10.0854730652645, 49.393084372626),
list(11.123681073077,
49.393084372626),
list(11.123681073077, 48.8285498294904),
list(10.0854730652645, 48.8285498294904)
),
list(
list(10.3, 48.9), list(10.3, 49.1), list(10.7, 49.1), list(10.7, 48.9),
list(10.3, 48.9)))),
.Names = c("type",
"coordinates")
)
),
.Names = c("type", "properties", "geometry")
))
Note that leaflet the R package will get support for true leaflet-MultiPolygon, but it doesn't have it yet.
I'm confused about why addGeoJSON takes an R list rather than a string though, we really need some clarity in this area, and some semantics for transferring between raw formats (like a string, or binary format), their analogous representation within R, and the form needed to drive another tool. Sometimes there are too many steps in the chain.
(I often encounter the idea that "simple features" is the right central form but it's clearly not able to represent everything we need, TopoJSON for example. I am working on the design of that central form but it's still a little early). We don't necessarily need a universal form all the time, but there is a need for guidance on what chained steps make sense in terms of efficiency and preserving all information with minimal redundancy.
And clearly we need better tools for GeoJSON and TopoJSON, rather than hacking these internal forms, we are half-ish way through that with some tools here - I'm not sure how what we are doing meshes with geojsonio, yet. TopoJSON requires partial indexing between shared tables (or structures), so it's a good bridge for what the broader central scheme I've been working towards can provide.
Maybe this, and linking to @ateucher helps?
Sorry, I now realize that my previous comment probably doesn't make sense in this context.
Thanks for all the input. I will promote to a new issue for geojson as R list to sf.
Most helpful comment
Wow, lots of discussion since I last had a chance to look at this. I am in favor of @edzer's proposed cases.
My only small quibble is including
return_layersas an argument withst_readmakes the function a bit monolithic and breaks type stability. Why not include anst_infofunction withogrinfolike functionality which can then be used for retrieving layer informatio?