Sf: Run SQL query on OGR layers

Created on 30 Aug 2018  Â·  24Comments  Â·  Source: r-spatial/sf

An SO questioner: https://stackoverflow.com/questions/52086882/issue-with-query-in-st-read/52102296#52102296 wanted to filter shapefiles with an SQL query, as you can do in ogr2ogr for example, but the query parameter is only valid for DBIObject reads in st_read.

I answered by showing how to build a VRT file with the SQL query, which works fine but is a bit of a pain.

Could gdal_read.cpp use the ExecuteSQL method on an optional passed SQL query string? I experimented and had success with putting:

poDS->ExecuteSQL("select * from codes_postaux_region where POP2010 > 20000", NULL, NULL);

on line 225, which successfully reduced my read data from 6000 features to 707. I'm not sure I completely understand what the ExecuteSQL method does, or what the returned OGRLayer is (see https://www.gdal.org/classGDALDataset.html#a5b65948b1e15fa63e96c0640eb6c5d7c) or what the spatial filter or dialect parameters can do (I left them NULL) but its a proof-of-concept. Real code would have to pass the query string through, changing the parameters in the C++ and the R and so on, and my Rcpp is inadequate to get this all right, so I just hardcoded one SQL test string.

I can't see any discussion here about implementing this already, apologies if this has already been excluded as a possible enhancement.

B

All 24 comments

Nice idea! PR welcome.

This issue is very similar to https://github.com/hypertidy/vapour/issues/34. Maybe it would be of use here?

There's a more complete working example in a branch of my fork here: https://github.com/barryrowlingson/sf/tree/execute_sql - I've no idea of the quality so I'm not sure I should hit the PR button. I can do if it helps, but I don't really know what I'm doing with C++...

Looks OK; I wonder whether the select * from nc cannot be constructed (probably on the R side), since we know that the layer is nc. The real query is the where clause, right? @mdsumner

Also, would it be possible to do selections based on geometries? Maybe not, or at least not portably, since GDAL on MacOS X does not compile against GEOS (where dynamic linking not allowed by CRAN, static not made possible by GDAL devs); @rsbivand ? Nevertheless, it might be nice-to-have on platforms that do support it?

You might not want to select all columns from an OGR data source. I was following the ogr2ogr parameter -sql, but maybe st_read could also have a where= parameter?

I also think my patch needs to release the result set from the ExecuteSQL call, according to the documentation. https://www.gdal.org/classGDALDataset.html#ab2c2b105b8f76a279e6a53b9b4a182e0

In my experience it's better to scan the extents and attributes, and short circuit on a limited number of features or on attribute values - ExecuteSQL is less efficient than that (for OGRSQL). I don't want all this loaded in this one mega package, but I'm happy to share lessons from vapour for those of you that do.

Some layer names cause tricky escaping problems, hence efforts to read FID values directly - I just got stuck on 64-bit int issues for a complete treatment.

Here's my suggestion for sf, let the user take responsibility for what they get from:

  • input sql argument triggers ExecuteSQL in st_read and functions reading bbox, geometry or attributes
  • input limit_n argument short circuits feature iteration
  • add st_read_bbox for GDAL sources (API has a method on features)
  • add read_attributes for GDAL sources
  • consider custom filters tested during feature iteration

That covers a lot of desrable customization IMO, and might help trigger greater PR input. All those ideas are implemented in vapour at least to a proof of concept level. I want them as an independent API, so I'm figuring it out independently.

Can st_sfc create a vector of bbox? Probably not, is there anything wrong with that? I'll try

Nonstarter, raw vectors of bounds much better. NVM, I'll work on the other stuff.

demo(nc, ask = FALSE)
 st_as_sfc(st_bbox(nc))

Let's go @barryrowlingson ! Any blocker here? I finally got DBI/dbplyr support to work so I'm keen to have this in sf proper

I'm planning to release a new sf version soon. Could one of you come up with a PR to do this, preferably with one or two tests?

I'm close, didn't want to stomp on @barryrowlingson here.

There's an issue with the geometry column though, because (I think) CPL_read_ogr is getting the geom from the fields, rather than with poFeature->GetGeometryRef. I haven't tracked down if it can be made to work with GPKG, with for example

## gets no 'geom' so no geometry column
st_read("nc.gpkg", query = "SELECT NAME FROM 'nc.gpkg'")  

whereas

## gets 'geometry column
st_read("nc.shp", query = "SELECT NAME FROM nc")  

I guess I'll push the PR so it can be investigated

Go for it Michael, I rate your competency above mine for this.

On Wed, Sep 26, 2018 at 3:14 PM, Michael Sumner notifications@github.com
wrote:

I'm close, didn't want to stomp on @barryrowlingson
https://github.com/barryrowlingson here.

There's an issue with the geometry column though, because (I think)
CPL_read_ogr is getting the geom from the fields, rather than with
poFeature->GetGeometryRef. I haven't tracked down if it can be made to
work with GPKG, with for example

gets no 'geom' so no geometry column

st_read("nc.gpkg", query = "SELECT NAME FROM 'nc.gpkg'")

whereas

gets 'geometry column

st_read("nc.shp", query = "SELECT NAME FROM nc")

I guess I'll push the PR so it can be investigated

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/r-spatial/sf/issues/834#issuecomment-424731143, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA2QlLroF2IrdVyudrt4lKR09h7HIZ8aks5ue4uygaJpZM4WT6he
.

Thanks! @mdsumner It uses GetGeomFieldRef(i) which according to this should be equivalent when i is 0. It just loops over the geometry columns, of which usually there is only 1.

Thanks so much for the fast action!!

No worries, I'm definitely worried about this though - did you notice in the PR comment?

gpkg_layer <- "nc.gpkg"
## we specify  geom
nc_gpkg_sql = st_read(system.file("gpkg/nc.gpkg", package = "sf"),
        query = sprintf("SELECT NAME, SID74, FIPS, geom  FROM '%s' WHERE BIR74 > 20000", gpkg_layer))

## no geom, no sf  (shp doesn't need the select geom)
nc_gpkg_no_geom = st_read(system.file("gpkg/nc.gpkg", package = "sf"),
        query = sprintf("SELECT NAME, SID74, FIPS  FROM '%s' WHERE BIR74 > 20000", gpkg_layer))

Yes. Can we use layer for %s? By default, this is taken to be basename(dsn):

> basename("gpkg/nc.gpkg")
[1] "nc.gpkg"

I think that's peculiar to GPKG, usually it's basename without extension, or arbitrary, multiple, and unrelated to the filename. The layer name doesn't change when the actual geopackage file name is changed, so it must be baked in on creation.

Ah: sometimes (shape) we need no quotes around layer, other times (gpkg) we do. Is that it?

Yes, for gpkg layer name is baked into the gpkg file, not derived from its name.

I though single quotes might be a safe default, but no - double quotes seem to be:

## fails
st_read(system.file("shape/nc.shp", package="sf"),
        query = "SELECT NAME, SID74, FIPS FROM 'nc' WHERE BIR74 > 20000")

# ok
st_read(system.file("shape/nc.shp", package="sf"),
                            query = "SELECT NAME, SID74, FIPS FROM \"nc\" WHERE BIR74 > 20000")

# ok
st_read(system.file("gpkg/nc.gpkg", package="sf"),
        query = "SELECT NAME, SID74, FIPS, geom FROM 'nc.gpkg' WHERE BIR74 > 20000")

That's definitely driver dependent sometimes, i.e. Access, Manifold and SQL Server afaik will take square brackets [layer], but there's also a difference when OGRSQL can deal with the query versus passing it to the engine.

This works with the PostGIS driver:

st_read("PG:dbname=postgis", "sids", 
      query = "SELECT NAME, BIR74, FIPS, wkb_geometry FROM \"sids\" WHERE BIR74 > 20000")

So, we settle with double quotes? I guess this has to go into the docs, as users supply a full query?

Yes I don't think it's worth wrapping here, there is just too many possibilities, and DBI has facilities for escaping. The geom-as-field is the major problem, because while it's easy enough to discover interactively 'select * from layer limit 1' that won't work for DBI. But, I think it's not for right now because it will need pretty big changes.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

tiernanmartin picture tiernanmartin  Â·  3Comments

kendonB picture kendonB  Â·  3Comments

gregmacfarlane picture gregmacfarlane  Â·  4Comments

ekarsten picture ekarsten  Â·  4Comments

angela-li picture angela-li  Â·  4Comments