Hi! I'm working on a project on spatial networks in Greater London and I'm downloading the network data from Geofabrik.
When I read the data I get a warning
london_highways <- sf::st_read(
dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf",
layer = "lines",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62788 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
Created on 2019-12-09 by the reprex package (v0.3.0)
so I checked on gdal website on the meaning of that error and, after reading your answers in #1157, I tried to that set parameter. The problem is that none of these options work:
london_highways <- sf::st_read(
dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf",
layer = "lines",
options = c("OGR_INTERLEAVED_READING=YES"),
stringsAsFactors = FALSE
)
#> options: OGR_INTERLEAVED_READING=YES
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62788 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
london_highways
#> Simple feature collection with 62788 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> First 10 features:
#> osm_id name highway waterway aerialway barrier man_made
#> 1 74 Ballards Lane primary <NA> <NA> <NA> <NA>
#> 2 75 High Road primary <NA> <NA> <NA> <NA>
#> 3 79 East End Road primary <NA> <NA> <NA> <NA>
#> 4 482 Cockfosters Road primary <NA> <NA> <NA> <NA>
#> 5 488 Western Way residential <NA> <NA> <NA> <NA>
#> 6 489 The Linkway residential <NA> <NA> <NA> <NA>
#> 7 490 Sherrards Way residential <NA> <NA> <NA> <NA>
#> 8 491 Grasvenor Avenue residential <NA> <NA> <NA> <NA>
#> 9 1200 Hanger Lane trunk <NA> <NA> <NA> <NA>
#> 10 1201 Hanger Lane Gyratory trunk_link <NA> <NA> <NA> <NA>
#> geometry
#> 1 LINESTRING (-0.193124 51.60...
#> 2 LINESTRING (-0.1767883 51.6...
#> 3 LINESTRING (-0.1979855 51.5...
#> 4 LINESTRING (-0.1607372 51.6...
#> 5 LINESTRING (-0.1886468 51.6...
#> 6 LINESTRING (-0.1883204 51.6...
#> 7 LINESTRING (-0.1896293 51.6...
#> 8 LINESTRING (-0.1892364 51.6...
#> 9 LINESTRING (-0.291608 51.51...
#> 10 LINESTRING (-0.2925582 51.5...
london_highways <- sf::st_read(
dsn = "https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf",
layer = "lines",
options = c("INTERLEAVED_READING=YES"),
stringsAsFactors = FALSE
)
#> options: INTERLEAVED_READING=YES
#> Reading layer `lines' from data source `https://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf' using driver `OSM'
#> Simple feature collection with 0 features and 9 fields
#> bbox: xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
london_highways
#> Simple feature collection with 0 features and 9 fields
#> bbox: xmin: NA ymin: NA xmax: NA ymax: NA
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
#> [1] osm_id name highway waterway aerialway barrier
#> [7] man_made z_order other_tags geometry
#> <0 rows> (or 0-length row.names)
Created on 2019-12-09 by the reprex package (v0.3.0)
How can I set that parameter? Do I have to modify the .ini config file?
This problem is related to https://github.com/ITSLeeds/geofabric/issues/12 so apologies for "crossposting".
The PBF file is approximately 55MB of data and the same problem/warning disapper if I consider a smaller PBF.
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language EN
#> collate English_United Kingdom.1252
#> ctype English_United Kingdom.1252
#> tz Europe/Berlin
#> date 2019-12-09
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.3.2 2019-09-22 [1] CRAN (R 3.6.1)
#> class 7.3-15 2019-01-01 [2] CRAN (R 3.6.1)
#> classInt 0.4-2 2019-10-17 [1] CRAN (R 3.6.1)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
#> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
#> e1071 1.7-2 2019-06-05 [1] CRAN (R 3.6.0)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> KernSmooth 2.23-15 2015-06-29 [2] CRAN (R 3.6.1)
#> knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
#> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
#> rmarkdown 1.17 2019-11-13 [1] CRAN (R 3.6.1)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> sf 0.8-0 2019-09-17 [1] CRAN (R 3.6.1)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.3.0 2019-11-05 [1] CRAN (R 3.6.1)
#> units 0.6-5 2019-10-08 [1] CRAN (R 3.6.1)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
#>
#> [1] C:/Users/Utente/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.1/library
I see the same warning, but why do you think it is an error? Do you have an indication that not all the features are being read?
I never checked that (and I'm not sure how) but this morning I tested that hypothesis with the following expertiment (following the same ideas as in #1157). I think it shows some evidence that some features are missing with the OGR_INTERLEAVED_READING warning.
# packages
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
# download some data
download.file("http://download.geofabrik.de/europe/isle-of-man-latest.osm.pbf", "isle-of-man.osm.pbf", mode = "wb")
download.file("http://download.geofabrik.de/europe/andorra-latest.osm.pbf", "andorra.osm.pbf", mode = "wb")
download.file("http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", "greater-london.osm.pbf", mode = "wb")
# isle of man example
isle_of_man <- st_read(
"isle-of-man.osm.pbf",
layer = "lines",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\isle-of-man.osm.pbf' using driver `OSM'
#> Simple feature collection with 12943 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -20.05167 ymin: 48.18019 xmax: -2.913484 ymax: 56.7
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(isle_of_man %>% filter(highway == "primary"))
#> [1] 462
isle_of_man_query <- st_read(
"isle-of-man.osm.pbf",
layer = "lines",
query = "SELECT * FROM lines WHERE (highway = 'primary')",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\isle-of-man.osm.pbf' using driver `OSM'
#> Simple feature collection with 462 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -4.792454 ymin: 54.06365 xmax: -4.31848 ymax: 54.41379
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(isle_of_man_query)
#> [1] 462
# andorra example
andorra <- st_read(
"andorra.osm.pbf",
layer = "lines",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\andorra.osm.pbf' using driver `OSM'
#> Simple feature collection with 5439 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: 0.975577 ymin: 42.32422 xmax: 1.824654 ymax: 42.78834
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(andorra %>% filter(highway == "primary"))
#> [1] 460
andorra_query <- st_read(
"andorra.osm.pbf",
layer = "lines",
query = "SELECT * FROM lines WHERE (highway = 'primary')",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\andorra.osm.pbf' using driver `OSM'
#> Simple feature collection with 460 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: 1.419741 ymin: 42.43526 xmax: 1.737154 ymax: 42.63297
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(andorra_query)
#> [1] 460
# greater london example
greater_london <- st_read(
"greater-london.osm.pbf",
layer = "lines",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\greater-london.osm.pbf' using driver `OSM'
#> Warning in CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
#> GDAL Error 1: Too many features have accumulated in points layer. Use
#> OGR_INTERLEAVED_READING=YES mode
#> Simple feature collection with 62785 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.6124681 ymin: 51.2822 xmax: 0.3760979 ymax: 51.78736
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(greater_london %>% filter(highway == "primary"))
#> [1] 2094
greater_london_query <- st_read(
"greater-london.osm.pbf",
layer = "lines",
query = "SELECT * FROM lines WHERE (highway = 'primary')",
stringsAsFactors = FALSE
)
#> Reading layer `lines' from data source `C:\Users\Utente\AppData\Local\Temp\Rtmp2FNMdi\reprex1bd45b8b2596\greater-london.osm.pbf' using driver `OSM'
#> Simple feature collection with 12600 features and 9 fields
#> geometry type: LINESTRING
#> dimension: XY
#> bbox: xmin: -0.5022575 ymin: 51.28324 xmax: 0.2661656 ymax: 51.68595
#> epsg (SRID): 4326
#> proj4string: +proj=longlat +datum=WGS84 +no_defs
nrow(greater_london_query)
#> [1] 12600
Created on 2019-12-10 by the reprex package (v0.3.0)
I applied the same idea in all the examples: in the first part of each example I read the data using st_read and then I applied a filter (i.e. highway == "primary") while in the second part of each example I applied the same query before reading in the data (i.e. using the query parameter of st_read). In the first two examples I get the same number of rows using both methods while in the third example I get a OGR_INTERLEAVED_READING warning and less features in the first method wrt the second method.
Just to confirm: I can reproduce this issue with GDAL 3.0.2 and have found a work-around, convert to another file format with ogr2ogr first:
# Aim: test gdal's interleaved settings
gdal-config --version
wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf
ogr2ogr \
-f GPKG lnd.gpkg \
greater-london-latest.osm.pbf \
"lines"
Then back in R:
greater_london_gpkg = read_sf("lnd.gpkg")
nrow(greater_london_gpkg %>% filter(highway == "primary")) # 12600
Yes, or all in R:
> gdal_utils("vectortranslate", "greater-london-latest.osm.pbf", "greater-london-latest.gpkg")
> st_layers("greater-london-latest.gpkg")
Driver: GPKG
Available layers:
layer_name geometry_type features fields
1 points Point 326046 10
2 lines Line String 352014 9
3 multilinestrings Multi Line String 4332 4
4 multipolygons Multi Polygon 544434 25
5 other_relations Geometry Collection 15859 4
> st_layers("greater-london-latest.osm.pbf", do_count = TRUE)
Driver: OSM
Available layers:
layer_name geometry_type features fields
1 points Point 326046 10
2 lines Line String 62788 9
3 multilinestrings Multi Line String 0 4
4 multipolygons Multi Polygon 0 25
5 other_relations Geometry Collection 0 4
Warning messages:
1: In CPL_get_layers(dsn, options, do_count) :
GDAL Error 1: Too many features have accumulated in lines layer. Use OGR_INTERLEAVED_READING=YES mode
2: In CPL_get_layers(dsn, options, do_count) :
GDAL Error 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode
Yes, it is not part of the GDAL C API for GDAL utilities.
Ah I see, thanks for clarifying!
I just tested that code and it doesn't work on my laptop since I get the following error:
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
> gdal_utils("vectortranslate", "greater-london-latest.osm.pbf", "greater-london-latest.gpkg")
#> Error in gdal_utils("vectortranslate", "greater-london-latest.osm.pbf", :
#> gdal_utils vectortranslate: an error occured
#> In addition: Warning messages:
#> 1: In CPL_gdalvectortranslate(source, destination, options) :
#> GDAL Message 6: Normalized/laundered field name: 'admin_level' to 'admin_le_3'
#> 2: In CPL_gdalvectortranslate(source, destination, options) :
#> GDAL Error 6: Geometry type of `Geometry Collection' not supported in shapefiles. Type can be #> overridden with a layer creation option of SHPT=POINT/ARC/POLYGON/MULTIPOINT/POINTZ/ARCZ/POLYGONZ/MULTIPOINTZ/MULTIPATCH.
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 3.6.1 (2019-07-05)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language EN
#> collate English_United Kingdom.1252
#> ctype English_United Kingdom.1252
#> tz Europe/Berlin
#> date 2019-12-10
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
#> backports 1.1.5 2019-10-02 [1] CRAN (R 3.6.1)
#> callr 3.4.0 2019-12-09 [1] CRAN (R 3.6.1)
#> class 7.3-15 2019-01-01 [2] CRAN (R 3.6.1)
#> classInt 0.4-2 2019-10-17 [1] CRAN (R 3.6.1)
#> cli 1.1.0 2019-03-19 [1] CRAN (R 3.6.0)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.6.0)
#> DBI 1.0.0 2018-05-02 [1] CRAN (R 3.6.0)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
#> devtools 2.2.1 2019-09-24 [1] CRAN (R 3.6.1)
#> digest 0.6.23 2019-11-23 [1] CRAN (R 3.6.1)
#> e1071 1.7-3 2019-11-26 [1] CRAN (R 3.6.1)
#> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 3.6.1)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
#> fs 1.3.1 2019-05-06 [1] CRAN (R 3.6.1)
#> glue 1.3.1 2019-03-12 [1] CRAN (R 3.6.0)
#> highr 0.8 2019-03-20 [1] CRAN (R 3.6.0)
#> htmltools 0.4.0 2019-10-04 [1] CRAN (R 3.6.1)
#> KernSmooth 2.23-15 2015-06-29 [2] CRAN (R 3.6.1)
#> knitr 1.26 2019-11-12 [1] CRAN (R 3.6.1)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 3.6.0)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.6.0)
#> pkgbuild 1.0.6 2019-10-09 [1] CRAN (R 3.6.1)
#> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.6.0)
#> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.6.0)
#> processx 3.4.1 2019-07-18 [1] CRAN (R 3.6.0)
#> ps 1.3.0 2018-12-21 [1] CRAN (R 3.6.0)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 3.6.1)
#> Rcpp 1.0.3 2019-11-08 [1] CRAN (R 3.6.1)
#> remotes 2.1.0 2019-06-24 [1] CRAN (R 3.6.1)
#> rlang 0.4.2 2019-11-23 [1] CRAN (R 3.6.1)
#> rmarkdown 1.18 2019-11-27 [1] CRAN (R 3.6.1)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.6.0)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
#> sf * 0.8-1 2019-12-10 [1] Github (r-spatial/sf@60208b7)
#> stringi 1.4.3 2019-03-12 [1] CRAN (R 3.6.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
#> testthat 2.3.1 2019-12-01 [1] CRAN (R 3.6.1)
#> units 0.6-5 2019-10-08 [1] CRAN (R 3.6.1)
#> usethis 1.5.1 2019-07-04 [1] CRAN (R 3.6.1)
#> withr 2.1.2 2018-03-15 [1] CRAN (R 3.6.0)
#> xfun 0.11 2019-11-12 [1] CRAN (R 3.6.1)
#> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.6.0)
#>
#> [1] C:/Users/Utente/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.1/library
so I think that it's a problem related to my GDAL version. I followed the instructions here for updating the GDAL distribution on windows using Conda and I completed all the steps but I still get the same "old" GDAL version linked to sf package. Sorry for the stupid question, but how can I update that?
Switch to Linux ; )
Support for auto-recognizing target driver from file name extension (i.e. .gpkg means geopackage) was apparently added to GDAL after 2.2.3; you'd have to specify the output format, which you can do. Anyway, this is getting of topic, so closing now.
Thanks, and just for future reference, I solved that problem as follows:
gdal_utils(
util = "vectortranslate",
source = "greater-london-latest.osm.pbf",
destination = "greater-london-latest.gpkg",
options = c("-f", "GPKG")
)
If you want I could create a PR and add this example to gdal_utils help page. I've never used GDAL (so maybe it's an obvious example) but it took me a lot of time to understand that options = "-f GPKG" was not the correct syntax.
PR welcome!
I'm not sure the original issue is solved. @edzer do you think the best solution in this situation, building on the reprex above, is to first convert the object and then read it in, e.g.:
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
download.file("http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf", "greater-london-latest.osm.pbf", mode = "wb")
gdal_utils(
util = "vectortranslate",
source = "greater-london-latest.osm.pbf",
destination = "greater-london-latest.gpkg",
options = c("-f", "GPKG")
)
greater_london_gpkg = read_sf("greater-london-latest.gpkg", layer = "lines")
nrow(greater_london_gpkg)
#> [1] 352158
Created on 2019-12-10 by the reprex package (v0.3.0)
I imagine there could be a faster way, e.g. selecting the layer in the translate stage, but not sure.
Out of interest I did a benchmark. An interesting question is 'at what point is interleaving/translate needed? For context, @agila5 and I are working on a package that aims to make access to large OSM datasets quicker and easier for R users (comments/suggestions on this welcome): https://github.com/ITSLeeds/geofabric
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
setwd("~/atfutures/who-data/")
get_highways1 = function(x, highway_type) {
x_gpkg = paste0(x, ".gpkg")
gdal_utils(
util = "vectortranslate",
source = x,
destination = x_gpkg,
options = c("-f", "GPKG")
)
full_res = read_sf(x_gpkg, layer = "lines")
full_res[full_res$highway == highway_type, ]
}
get_highways2 = function(x, highway_type) {
q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
read_sf(x, layer = "lines", query = q)
}
h1 = get_highways1("greater-london-latest.osm.pbf", "primary")
h2 = get_highways2("greater-london-latest.osm.pbf", "primary")
plot(st_geometry(h1))
plot(st_geometry(h2))

bench::mark(iterations = 1, check = F,
get_highways1("greater-london-latest.osm.pbf", "primary"),
get_highways2("greater-london-latest.osm.pbf", "primary")
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median
#> <bch:expr> <bch> <bch:>
#> 1 get_highways1("greater-london-latest.osm.pbf", "primary") 30.1s 30.1s
#> 2 get_highways2("greater-london-latest.osm.pbf", "primary") 2.7s 2.7s
#> # … with 3 more variables: `itr/sec` <dbl>, mem_alloc <bch:byt>, `gc/sec` <dbl>
Created on 2019-12-11 by the reprex package (v0.3.0)
I'm not sure the original issue is solved.
IMO the issue is not solved from the geofabric package pov and for pbf much bigger than my original example. For example, the PBF file of the region where I live is here and it's almost 400MB. The corresponding GPKG file is 2.2GB and the conversion time is more than 15 minutes.
> system.time({
+ gdal_utils(
+ util = "vectortranslate",
+ source = "nord-ovest-latest.osm.pbf",
+ destination = "nord-ovest-latest.gpkg",
+ options = c("-f", "GPKG")
+ )
+ })
user system elapsed
964.39 425.41 1464.15
> file.size("nord-ovest-latest.gpkg") / 1e9
[1] 2.404442
Still I'm not sure if it's possible to read the PBF file without the conversion e.g. by properly setting the INTERLEAVED_READING option (and check the timings) since I don't know how to do that.
An interesting question is 'at what point is interleaving/translate needed?
I read the GDAL documentation of the PBF format but they did not mention anything about the "maximum size" (and I'm not sure it does make sense to talk about a max size). They explain the problem in the interleaved reading section.
Interesting, does that section mean that OGR_INTERLEAVED_READING=YES is outdated on recent versions of GDAL?
Starting with GDAL 2.2, applications should use the GDALDataset::GetNextFeature() API to iterate over features in the order they are produced.
@edzer this issue does not seem resolved and am not sure why it was closed, other than it briefly going off topic.
It seems OGR_... is, but INTERLEAVED_READING=YES is not outdated:
edzer@gin-edzer:/tmp$ ogrinfo -oo INTERLEAVED_READING=YES *pbf points > x2
edzer@gin-edzer:/tmp$ ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1
Warning 6: driver OSM does not support open option OGR_INTERLEAVED_READING
ERROR 1: Too many features have accumulated in lines layer. Use OGR_INTERLEAVED_READING=YES mode
but that the effect is the same (x1 and x2 are identical). My idea behind closing this was that I felt it was not an sf issue, but if anything, a GDAL issue. What do you think?
I think it may be a GDAL issue and was considering raising an issue. If you're up for mentoring us through the process and help us develop a decent reprex (e.g. based on the rich examples above) I would happily submit one. Note, thought, that I've commented on a closed related issue here: https://github.com/OSGeo/gdal/issues/1785
From what I can gather a good starting point for a GDAL reprex could be something like:
wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf
ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1
Although I must confess I'm not 100% sure what the second line does and suspect we may get a better response if you raise the issue on our (and sf's) behalf.
I think the warning & error message sufficiently illustrate the problem, and that the first command
ogrinfo -oo INTERLEAVED_READING=YES *pbf points > x2
suggests the docs are correct, but that the error message gives the wrong hint.
OK, shall I open a ticket on the GitHub Issue tracker?
And will the solution to that issue resolve this issue sf side (or do we just need to add some incantation like options = c("INTERLEAVED_READING", "YES") (I've not tested this)?
And will the solution to that issue resolve this issue sf side (or do we just need to add some incantation like
options = c("INTERLEAVED_READING", "YES")(I've not tested this)?
What makes you think so?
Tested it and no it doesn't work:
greater_london <- st_read(
+ "greater-london-latest.osm.pbf",
+ layer = "lines",
+ stringsAsFactors = FALSE,
+ options = c("INTERLEAVED_READING", "YES")
+ )
options: INTERLEAVED_READING YES
Reading layer `lines' from data source `/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/atfutures/creds2/od-data/greater-london-latest.osm.pbf' using driver `OSM'
Simple feature collection with 0 features and 9 fields
bbox: xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Warning messages:
1: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
GDAL Message 6: open option 'INTERLEAVED_READING' is not formatted with the key=value format
2: In CPL_read_ogr(dsn, layer, query, as.character(options), quiet, :
GDAL Message 6: open option 'YES' is not formatted with the key=value format
This one did though, issue solved I think :tada:
greater_london <- st_read(
+ "greater-london-latest.osm.pbf",
+ layer = "lines",
+ stringsAsFactors = FALSE,
+ options = "INTERLEAVED_READING=YES")
options: INTERLEAVED_READING=YES
Reading layer `lines' from data source `/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/atfutures/creds2/od-data/greater-london-latest.osm.pbf' using driver `OSM'
Simple feature collection with 0 features and 9 fields
bbox: xmin: NA ymin: NA xmax: NA ymax: NA
epsg (SRID): 4326
proj4string: +proj=longlat +datum=WGS84 +no_defs
Can you reproduce this @agila5 ?
I think it's the same example as in my first comment and the problem is that with that option it reads 0 features. It doesn't make sense, right?
I agree, but I get the same when I try
edzer@gin-edzer:/tmp$ ogrinfo -oo INTERLEAVED_READING=YES *pbf lines > l2
edzer@gin-edzer:/tmp$ ogrinfo *pbf lines > l1
ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode
edzer@gin-edzer:/tmp$ wc l1 l2
470818 1744967 19993830 l1
30 74 831 l2
470848 1745041 19994661 total
suggesting it is a GDAL issue, not an sf issue.
Correct, it returns 0 features, sorry for going round in circles. Looks like a GDAL issue that could solve this issue when closed (it may be worth keeping this issue open as an issue that requires changes in GDAL to be closed). Final question: how to I demonstrate that
ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf points > x1
yields 0 features?
Open x1 in a text editor?
Please proof-read this issue:
Dear GDAL dev team, I think there is an issue with the osm driver. The error message it produces does not seem to be useful (it says Use OGR_INTERLEAVED_READING=YES when that option is specified) and when using INTERLEAVED_READING=YES, which I think is correctly documented here https://gdal.org/drivers/vector/osm.html#interleaved-reading , the results seem to have no features.
Please could you try to reproduce the (hopefully reproducible) example below?
# Aim: demonstrate issues with the OSM driver
gdalinfo --version
# $ GDAL 2.4.2, released 2019/06/28
# get data
wget http://download.geofabrik.de/europe/great-britain/england/greater-london-latest.osm.pbf
ogrinfo -oo OGR_INTERLEAVED_READING=YES *pbf lines > x2
# $ Warning 6: driver OSM does not support open option OGR_INTERLEAVED_READING
# $ ERROR 1: Too many features have accumulated in points layer. Use OGR_INTERLEAVED_READING=YES mode
ogrinfo -oo INTERLEAVED_READING=YES *pbf lines > x3
wc x3
# $ 27 65 740 x3 # contains no features
This may be related to https://github.com/OSGeo/gdal/issues/1785.
See here for use case and some additional tests from within R: https://github.com/r-spatial/sf/issues/1213
(If you post that comment as a github issue I would also cite the other related github issue, i.e. https://github.com/OSGeo/gdal/issues/1785, maybe helps with the context.)
Thanks @agila5, I've added the link to that closed issue. @edzer good to go from your perspective?
I think we have a solution. In https://github.com/OSGeo/gdal/issues/2100 Even Rouault suggests converting only the layer we need to a GeoPackage before reading. I was sceptical about this 2 stage process but it seems it is indeed the quickest way, as benchmarked below (with conversion only needing to run once, hence 2 get3() calls):
get1 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
x_gpkg = paste0(x, ".gpkg")
gdal_utils(
util = "vectortranslate",
source = x,
destination = x_gpkg,
options = c("-f", "GPKG")
)
full_res = read_sf(x_gpkg, layer = "lines")
full_res[full_res$highway == highway_type, ]
}
get2 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
read_sf(x, layer = "lines", query = q)
}
get3 = function(x = "greater-london-latest.osm.pbf", highway_type = "residential") {
q = paste0("SELECT * FROM lines WHERE (highway = '", highway_type, "')")
dest_filename = paste0(x, ".gpkg")
if(!file.exists(dest_filename)) {
sf::gdal_utils(
util = "vectortranslate",
source = "greater-london-latest.osm.pbf",
destination = dest_filename,
options = c("-f", "GPKG", "lines")
)
}
st_read(dest_filename, query = q)
}
h1 = get1()
h2 = get2()
h3 = get3()
nrow(h1)
nrow(h2)
nrow(h3)
file.remove("greater-london-latest.osm.pbf.gpkg")
bench::mark(iterations = 1, check = FALSE, get1(), get2(), get3(), get3())
Could you try to reproduce this @agila5 to check it works on Windows? If so I suggest we go with this solution in geofabric.
I get slighlty different results between h1 and h2 and h3 because full_res[full_res$highway == highway_type, ] includes only the rows in which highway is NA. In fact,
> table(h1$highway, useNA = "ifany")
residential <NA>
70406 74234
Another thing that I noticed is that the geometry column in h2 isn't called geom or geometry
> colnames(h2)
[1] "osm_id" "name" "highway" "waterway" "aerialway"
[6] "barrier" "man_made" "z_order" "other_tags" "_ogr_geometry_"
If I correct those "problems", then I get:
> bench::mark(iterations = 10, get1(), get2(), get3())
# A tibble: 3 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time
<bch:expr> <bch:> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <lis>
1 get1() 1.61m 1.67m 0.0100 193.4MB 0.0631 10 63 16.6m <tibb~ <df[,~ <bch~
2 get2() 20.96s 21.38s 0.0463 26.5MB 0.0370 10 8 3.6m <tibb~ <df[,~ <bch~
3 get3() 1.45s 1.5s 0.658 26.5MB 0.592 10 9 15.2s <tibb~ <df[,~ <bch~
# ... with 1 more variable: gc <list>
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled.
Anyway I will add more tests and I think that now we can move the discussion to geofabrik repo.
Great, all that is pointing towards the 'h3' solution for me, main thing, it works (touch wood)! Great thread, good idea to move it back to geofabric repo.