Turf: indexing

Created on 30 Dec 2014 · 8Comments · Source: Turfjs/turf

So far I have rejected anything related to indexing. I think of it as a problem for the data layer, rather than the processing layer. That said, I think it should be considered.

pros

we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).
it is useful functionality, and many users will be unable to figure out how to use index modules in conjunction with turf effectively on their own
several indexes (s2 cells, tile quadkeys, and geohashes) are data independent, so they scale across clusters (r-tree, for example, does not)

cons
I am not aware of an existing convention for indexes in GeoJSON, besides the id field, which is flimsy for this
multiple types of indexes on the same dataset might clash
some common indexing schemes do not scale well in clusters (r-tree)
the fact that users have a hard time understanding indexes might just mean this allows them to do dumb things faster :smile:
scalable indexes tend to rely on "universal axioms" about planet earth; turf should work work gracefully without these contextual assumptions (turf is not currently perfect on this point: units of measurement, WGS84 encoding, assumptions about precision with 32bit floats)

candidates for submodules
tile-cover: fast and scales across clusters
rbush: fast but each box needs the index in memory (this might not matter, depending on some architecture decisions we make)
geohash: fast and scalable, but does not work on lines or polygons (although maybe it would work fine with geohashed bbox sw & ne points)
tilebelt: minimum bbox quadkey would be fast and scalable (same issues as geohash)

I am 100% on the fence with this. Input greatly appreciated from anyone who understands how indexing works at a low level and people who just need to use them). This has huge architecture implications, so we really need both sides.

question wontfix

Source

morganherlocker

❤1 🎉1 😄1 👍1

Most helpful comment

Since there are many different ways to implement an index (r-tree, geohashes, etc...). It's best to leave these types of index operations/implementations outside of TurfJS.

For example, the r-tree index implementation rbush or geojson-rbush can be used.

Without going too much outside the TurfJS scope, we could include the bbox attribute to the output results for a few modules (ex: @turf/line-segment) without sacrificing too much performance loss.

The benefit of having this bbox attribute already existing to each feature would reduce significantly the number of @turf/bbox calculations done when building an geohash/ r-tree index (in the case that the geometry changes, then the bbox attribute would have to be re-calculated).
http://geojson.org/geojson-spec.html#bounding-boxes

GeoJSON BBox attribute example

  { "type": "Feature",
    "properties": {},
    "bbox": [-10.0, -10.0, 10.0, 10.0],
    "geometry": {
      "type": "Polygon",
      "coordinates": [[
        [-10.0, -10.0], [10.0, -10.0], [10.0, 10.0], [-10.0, 10.0]
        ]]
      }
    ...
    }

@tmcw @morganherlocker Fair to say we can close this issue?

DenisCarriere on 8 Apr 2017

👍3

All 8 comments

I've tinkered with this with the revisions to turf-count and making turf-extent wayfaster - basically for a bunch of algorithms where we repeatedly do something expensive like point-in-polygon, doing point-in-extent first gives a serious boost. But the algorithms that benefit from this, or any other indexing strategy, seem somewhat limited for now.

we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).

Only for repeated checking of the same polygon - the initial processing of the index would be more expensive than the current pip algorithm.

From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

tmcw on 30 Dec 2014

@tmcw automatic indexes on the fly are probably bad (without some finicky heuristics as to when they should happen). What if indexing required a manual call (something like var myPoly = index(myPoly, 'quadkey')), that left an index embedded in the data. This would be checked for in turf subs that could use it.

From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.

Agree with this 100% from an advanced user's perspective; people who can chop processing into reasonable chunks using indexes. I am thinking more about someone trying to find which of a million points are within a polygon. Right now, a beginner is going to be using turf like PostGIS minus indexes (unusably slow in common real world situations).

For example, what if turf-inside looked for a tile index? If its there, then it does a trivial in-out quadkey check. If not, then it goes ahead with the full algorithm. This might be close to best-of-both-worlds.

morganherlocker on 30 Dec 2014

I am thinking more about someone trying to find which of a million points are within a polygon.

I think with the small optimizations I've made for this will give a pretty good bump - I'll try this out later today.

tmcw on 30 Dec 2014

tmcw on 23 Jan 2015

If there are big unknowns or tradeoffs in the choice of index datastructure, would it be possible to define algorithms like turf-inside in terms of an index _protocol_ rather than a specific concrete implementation? Turf would then provide implementations of the protocol for rbush, tile-cover, tilebelt, etc, and users would supply their choice when instantiating a particular algorithm.

jfirebaugh on 20 Feb 2015

Hi all! +1 for protocols as @jfirebaugh mentioned (like rbush implementation with https://github.com/jvrousseau/turf-index).

stepankuzmin on 21 Sep 2015

An old issue, however very useful topic!

So far I've been using RBush for everything index related and it is VERY efficient on extremely large GeoJSON datasets.

I had to create my own "Ad Hoc" GeoJSON implementation in @turf/line-intersect, this example is a bit TOO complex for what we would need.

It would be useful to have a basic GeoJSON support implementation of RBush called @turf/index.

We would need to add some extra default properties while loading the RBush Tree to be able to retrieve the GeoJSON again (index & geojson) and yes this would destroy your memory if you are handling large Objects... (open for suggestions).

tree.insert({
    minX: minX,
    minY: minY,
    maxX: maxX,
    maxY: maxY,
    index: index,
    geojson: geojson
})

It would @returns {RBush Tree}, that way TurfJS only needs to implement loading GeometryCollection, FeatureCollection & Feature.

DenisCarriere on 7 Mar 2017

👍1