So far I have rejected anything related to indexing. I think of it as a problem for the data layer, rather than the processing layer. That said, I think it should be considered.
several indexes (s2 cells, tile quadkeys, and geohashes) are data independent, so they scale across clusters (r-tree, for example, does not)
I am not aware of an existing convention for indexes in GeoJSON, besides the id field, which is flimsy for this
scalable indexes tend to rely on "universal axioms" about planet earth; turf should work work gracefully without these contextual assumptions (turf is not currently perfect on this point: units of measurement, WGS84 encoding, assumptions about precision with 32bit floats)
tile-cover: fast and scales across clusters
I am 100% on the fence with this. Input greatly appreciated from anyone who understands how indexing works at a low level and people who just need to use them). This has huge architecture implications, so we really need both sides.
I've tinkered with this with the revisions to turf-count and making turf-extent wayfaster - basically for a bunch of algorithms where we repeatedly do something expensive like point-in-polygon, doing point-in-extent first gives a serious boost. But the algorithms that benefit from this, or any other indexing strategy, seem somewhat limited for now.
we could use the indexes internally to speed up many algorithms (turf-inside immediately comes to mind).
Only for repeated checking of the same polygon - the initial processing of the index would be more expensive than the current pip algorithm.
From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.
@tmcw automatic indexes on the fly are probably bad (without some finicky heuristics as to when they should happen). What if indexing required a manual call (something like var myPoly = index(myPoly, 'quadkey')), that left an index embedded in the data. This would be checked for in turf subs that could use it.
From what I've seen, the biggest perf issue in turf is the jsts'y algorithms, so they're top priority in my mind. I think indexing is useful but there's more low-hanging fruit to get in the short term.
Agree with this 100% from an advanced user's perspective; people who can chop processing into reasonable chunks using indexes. I am thinking more about someone trying to find which of a million points are within a polygon. Right now, a beginner is going to be using turf like PostGIS minus indexes (unusably slow in common real world situations).
For example, what if turf-inside looked for a tile index? If its there, then it does a trivial in-out quadkey check. If not, then it goes ahead with the full algorithm. This might be close to best-of-both-worlds.
I am thinking more about someone trying to find which of a million points are within a polygon.
I think with the small optimizations I've made for this will give a pretty good bump - I'll try this out later today.
If there are big unknowns or tradeoffs in the choice of index datastructure, would it be possible to define algorithms like turf-inside in terms of an index _protocol_ rather than a specific concrete implementation? Turf would then provide implementations of the protocol for rbush, tile-cover, tilebelt, etc, and users would supply their choice when instantiating a particular algorithm.
Hi all! +1 for protocols as @jfirebaugh mentioned (like rbush implementation with https://github.com/jvrousseau/turf-index).
An old issue, however very useful topic!
So far I've been using RBush for everything index related and it is VERY efficient on extremely large GeoJSON datasets.
I had to create my own "Ad Hoc" GeoJSON implementation in @turf/line-intersect, this example is a bit TOO complex for what we would need.
It would be useful to have a basic GeoJSON support implementation of RBush called @turf/index.
We would need to add some extra default properties while loading the RBush Tree to be able to retrieve the GeoJSON again (index & geojson) and yes this would destroy your memory if you are handling large Objects... (open for suggestions).
tree.insert({
minX: minX,
minY: minY,
maxX: maxX,
maxY: maxY,
index: index,
geojson: geojson
})
It would @returns {RBush Tree}, that way TurfJS only needs to implement loading GeometryCollection, FeatureCollection & Feature.
Since there are many different ways to implement an index (r-tree, geohashes, etc...). It's best to leave these types of index operations/implementations outside of TurfJS.
For example, the r-tree index implementation rbush or geojson-rbush can be used.
Without going too much outside the TurfJS scope, we could include the bbox attribute to the output results for a few modules (ex: @turf/line-segment) without sacrificing too much performance loss.
The benefit of having this bbox attribute already existing to each feature would reduce significantly the number of @turf/bbox calculations done when building an geohash/ r-tree index (in the case that the geometry changes, then the bbox attribute would have to be re-calculated).
http://geojson.org/geojson-spec.html#bounding-boxes
GeoJSON BBox attribute example
{ "type": "Feature",
"properties": {},
"bbox": [-10.0, -10.0, 10.0, 10.0],
"geometry": {
"type": "Polygon",
"coordinates": [[
[-10.0, -10.0], [10.0, -10.0], [10.0, 10.0], [-10.0, 10.0]
]]
}
...
}
@tmcw @morganherlocker Fair to say we can close this issue?
Most helpful comment
Since there are many different ways to implement an index (r-tree, geohashes, etc...). It's best to leave these types of index operations/implementations outside of TurfJS.
For example, the r-tree index implementation
rbushorgeojson-rbushcan be used.Without going too much outside the TurfJS scope, we could include the
bboxattribute to the output results for a few modules (ex: @turf/line-segment) without sacrificing too much performance loss.The benefit of having this
bboxattribute already existing to each feature would reduce significantly the number of@turf/bboxcalculations done when building angeohash/r-treeindex (in the case that the geometry changes, then thebboxattribute would have to be re-calculated).http://geojson.org/geojson-spec.html#bounding-boxes
GeoJSON BBox attribute example
@tmcw @morganherlocker Fair to say we can close this issue?