Elasticsearch: Default geo_shapes to quadtree

Created on 21 Jul 2017  路  10Comments  路  Source: elastic/elasticsearch

We have been discussing this for a while and it seems to be that quadtree(s) are much faster to be indexed. Also, most people don't realize that setting precision will then default distance_error_pct to zero, also causing the indexing times of geo shapes to be very slow.

My proposal here is to:

  1. Change the default tree to quadtree .
  2. Default distance_error_pct to 0.001 when precision is set.

CC @nknize

:AnalyticGeo >enhancement

Most helpful comment

Just to +1 this, after advice from @gmoskovicz we've changed to the quadtree type and have seen a dramatic improvement in all metrics, especially indexing and search latency (in the case of search latency, from ~50ms to sub 1ms). Feels like as long as it's explained well, it should be the default.

All 10 comments

Thanks @gmoskovicz

So we are trying to get rid of quadtree, but yet it is still the fastest and should be the default?

I probably confused you a bit...cause English :) Quadtree is a general algorithm for rasterizing geoshapes. How those quad raster cells are encoded in lucene's terms dictionary depends on the "tree" parameter which tells Lucene to encode either as a geohash string or packed binary Morton code. Packed morton encoding (selected by setting "tree" : "quadtree" is faster (index and query) and consumes less disk space than using "geohash" so that's why quadtree encoding is recommended.

Now, index and query speed is directly related to shape size and resolution because under the hood of Lucene each quad raster cell is converted into a bounding box and compared to the vectorized shape using a call to the JtsGeometry.relate method you reference in #2157. The higher the resolution the more cells are created (like a high resolution image) so the more calls to relate; and that method is slow. This is the general problem with using a raster based quadtree. For this reason we're working toward migrating to a B-kd based geo_shape indexing approach and (eventually) moving away from quadtrees altogether.

Hopefully that makes more sense.

@nknize is this the reason for my issue #25844 ?
for bigger polygon the index is slow.

@karesh That's right. For the issue you describe in #25844 I'd recommend taking a similar mapping approach to what's suggested above (e.g., set precision to something like 10m and default_error_pct to something like 0.025 - but play around w/ these parameters) . If the precision parameter is set, the distance_error_pct parameter is automatically set to 0; rationale being that users expect precision accuracy when they define an explicit precision. However, this can lead to performance issues as these quad cells are created (at both index and query time). The distance_error_pct will help alleviate these performance issues at the cost of false positives. At the moment there is no post-filtering of the results returned by lucene so you'd have to post filter the results in your application.

Just to +1 this, after advice from @gmoskovicz we've changed to the quadtree type and have seen a dramatic improvement in all metrics, especially indexing and search latency (in the case of search latency, from ~50ms to sub 1ms). Feels like as long as it's explained well, it should be the default.

+1 for changing the defaults as indicated. And just noting that for a project I'm assisting, with several simple polygons indexed as precision: 10m and tree: quadtree, indexing is very slow, consumed disk space is very large, and setting distance_error_pct: 0.001 doesn't really help with either.

@nknize any ETA for a bkd-tree based geo-shape indexing?

@synhershko have you seen any improvement testing the defaults vs quadtree with some error percentage?

I wouldn't expect it to fix every single case (because https://github.com/elastic/elasticsearch/issues/25833#issuecomment-317116918) but i would expect it to have a significant change comparing to the defaults.

Using 5.6.0 and I'm assuming the default is distance_error_pct: 0 - so we compared it to that with no significant improvement (or none at all). Started with quadtree to begin with.

fwiw the most significant drop in performance (and increase in disk size) was when we went from precision: 50m to precision: 10m (both are quadtree and distance_error_pct: 0.001). In our tests, 50m gives us bulk of 500 in ~1s and about 50kb/doc on disk while 10m is bulk of 500 in ~5s and about 150kb/doc on disk.

Since BKD-backed geo_shapes are soon to be the default (#35320), I'm going to close this issue.

I think the distance_error_pct default still has merit though, and will be addressing that in a different ticket (which results in OOM).

++ #35320 also sets the default tree parameter to quadtree for PrefixTrees

Was this page helpful?
0 / 5 - 0 ratings

Related issues

jpountz picture jpountz  路  3Comments

dadoonet picture dadoonet  路  3Comments

Praveen82 picture Praveen82  路  3Comments

rjernst picture rjernst  路  3Comments

abtpst picture abtpst  路  3Comments