We have been discussing this for a while and it seems to be that quadtree(s) are much faster to be indexed. Also, most people don't realize that setting precision
will then default distance_error_pct
to zero, also causing the indexing times of geo shapes to be very slow.
My proposal here is to:
tree
to quadtree
.0.001
when precision
is set.CC @nknize
Thanks @gmoskovicz
So we are trying to get rid of quadtree, but yet it is still the fastest and should be the default?
I probably confused you a bit...cause English :) Quadtree is a general algorithm for rasterizing geoshapes. How those quad raster cells are encoded in lucene's terms dictionary depends on the "tree"
parameter which tells Lucene to encode either as a geohash string or packed binary Morton code. Packed morton encoding (selected by setting "tree" : "quadtree"
is faster (index and query) and consumes less disk space than using "geohash"
so that's why quadtree encoding is recommended.
Now, index and query speed is directly related to shape size and resolution because under the hood of Lucene each quad raster cell is converted into a bounding box and compared to the vectorized shape using a call to the JtsGeometry.relate
method you reference in #2157. The higher the resolution the more cells are created (like a high resolution image) so the more calls to relate
; and that method is slow. This is the general problem with using a raster based quadtree. For this reason we're working toward migrating to a B-kd
based geo_shape
indexing approach and (eventually) moving away from quadtrees altogether.
Hopefully that makes more sense.
@nknize is this the reason for my issue #25844 ?
for bigger polygon the index is slow.
@karesh That's right. For the issue you describe in #25844 I'd recommend taking a similar mapping approach to what's suggested above (e.g., set precision
to something like 10m
and default_error_pct
to something like 0.025
- but play around w/ these parameters) . If the precision
parameter is set, the distance_error_pct
parameter is automatically set to 0; rationale being that users expect precision accuracy when they define an explicit precision. However, this can lead to performance issues as these quad cells are created (at both index and query time). The distance_error_pct
will help alleviate these performance issues at the cost of false positives. At the moment there is no post-filtering of the results returned by lucene so you'd have to post filter the results in your application.
Just to +1 this, after advice from @gmoskovicz we've changed to the quadtree type and have seen a dramatic improvement in all metrics, especially indexing and search latency (in the case of search latency, from ~50ms to sub 1ms). Feels like as long as it's explained well, it should be the default.
+1 for changing the defaults as indicated. And just noting that for a project I'm assisting, with several simple polygons indexed as precision: 10m
and tree: quadtree
, indexing is very slow, consumed disk space is very large, and setting distance_error_pct: 0.001
doesn't really help with either.
@nknize any ETA for a bkd-tree based geo-shape indexing?
@synhershko have you seen any improvement testing the defaults vs quadtree
with some error percentage?
I wouldn't expect it to fix every single case (because https://github.com/elastic/elasticsearch/issues/25833#issuecomment-317116918) but i would expect it to have a significant change comparing to the defaults.
Using 5.6.0 and I'm assuming the default is distance_error_pct: 0
- so we compared it to that with no significant improvement (or none at all). Started with quadtree to begin with.
fwiw the most significant drop in performance (and increase in disk size) was when we went from precision: 50m
to precision: 10m
(both are quadtree and distance_error_pct: 0.001
). In our tests, 50m gives us bulk of 500 in ~1s and about 50kb/doc on disk while 10m is bulk of 500 in ~5s and about 150kb/doc on disk.
Since BKD-backed geo_shapes are soon to be the default (#35320), I'm going to close this issue.
I think the distance_error_pct
default still has merit though, and will be addressing that in a different ticket (which results in OOM).
++ #35320 also sets the default tree
parameter to quadtree
for PrefixTrees
Most helpful comment
Just to +1 this, after advice from @gmoskovicz we've changed to the quadtree type and have seen a dramatic improvement in all metrics, especially indexing and search latency (in the case of search latency, from ~50ms to sub 1ms). Feels like as long as it's explained well, it should be the default.