Elasticsearch: Geo-shape indexing is very slow

Created on 9 Dec 2016 · 10Comments · Source: elastic/elasticsearch

I'm seeing very slow indexing with ElasticSearch when I set up a geo-shape type.
I'm using a stock 5.0.0 on Windows 7, no customisations, and I'm accessing it via the Python (3.5) wrapper. One shard.

I have 75 documents to index and if I don't set up a mapping, they index near instantly.

But the second I include a geo-shape mapping, the processing time rockets, as does CPU usage and inevitably it times out on the index creation, even with the timeout increased to 30s.

My mapping for the geo-shape looks like this:
"bbox": {
"type": "geo_shape",
"precision": "100m"
},
The rest of the mapping is just type:text. _all is disabled.

Each document has up to two geo-shapes. The shapes are simple envelope types as they're only bounding boxes - spatially this should be the simplest indexing there is for a polygon.

The problem is exagerated by the precision, but even with a precison of 1000m, when I send in larger quantiies of documents, it still flakes out after only a few hundred have been indexed.

Changing the tree type to quadtree is several times faster than the geohash, but it still times out after using considerably more resources.

I've seen the section on "performance considerations" in the docs (https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html ) - but this is far slower than I would expect any spatial indexing to occur.
In a conventional spatial RDBMS (PostGIS/Oracle) I'm used to being able to insert hundreds if not thousands of complex spatial features every second.
ElasticSearch is failing to handle even a few dozen a second of basic bounding boxes geometries.

I appreciate I haven't optimised my deployment (still in dev), but an unoptimised RDBMS is orders of magnitude faster at the same operation. I think there's considerable scope for optimisation here, especially compared to ES's non-spatial indexing.

:AnalyticGeo discuss

Source

mohmad-null

Most helpful comment

In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).

+1. Would love to have an ETA on this, we are currently experiencing the same in a large scale project.

synhershko on 20 Dec 2017

👍6

All 10 comments

@nknize could you provide some wisdom please

clintongormley on 12 Dec 2016

I'm experiencing something similar. I've created a brutish python script for showcasing this issue.
The following shows a simple python script that creates a GeoShape mapping and adds data.
This method indexes roughly 5 line-strings per second/ 2 minutes per 1000, while triggering the GC Allocation failure for the entirety of the load.

Of my current single node systems I would expect a minimum 1000 geo_shapes per second.

from random import random, randint
import math
import time

try:
    from elasticsearch import Elasticsearch

    es = Elasticsearch()
except ImportError:
    quit()

es_index = 'assets'


def get(query, index=es_index, es=es):
    body = {
            "query": query,
            "size": 1000
            }
    ret = es.search(index=index, body=body)
    return ret


def put(uuid, json, doc, index=es_index, es=es):
    ret = es.create(index=index,
                    doc_type=doc,
                    id=uuid,
                    body=json)
    return ret


def delete_index(index=es_index, es=es):
    ret = es.indices.delete(index=index)
    return ret


def create_index(index=es_index, es=es, number_of_shards=1, number_of_replicas=0):
    settings = {
        "settings": {
            "number_of_shards": number_of_shards,
            "number_of_replicas": number_of_replicas
        },
        "mappings": {
            "player": {
                "properties": {
                    "PATH": {
                        "type": "geo_shape"
                        # "precision": "10m"
                    }
                }

            }
        }
    }
    ret = es.indices.create(index=index, body=settings)
    return ret


def random_elastic_linestring(nodes, mx, my, Mx, My):
    linestring = []
    for n in range(0, nodes):
        linestring.append([random.randint(mx, Mx), random.randint(my, My)])
    es_doc = {
        "type": "linestring",
        "coordinates": linestring
    }
    return es_doc



if __name__ == "__main__":      
    try:
        delete_index()
    except:
        print("It doesn't exist")
    create_index()
    for i in range(1, 1000):
        x = (random() - .5) * 1000
        y = (random() - .5) * 1000
        a = {'id': 'randomname' + str(i),
             'asset': 'randomname',
             'Zone': 'X' + str(math.floor(x)) + 'Y' + str(math.floor(y)),
             'Travel_Rate': randint(400, 2000),
             'Draw_Level': 1,
             'HOME': None,
             'PATH': random_elastic_linestring(randint(2, 10), -5, -5, 5, 5),
             'DEL': False,
             '_time': time.time()}
        put(a['id'], a, 'player')

Elastic 5.1.1
JAVA_HOME: jdk 1_8_112
JVM Settings:
Xms4g
Xmx4g
-XX:+PrintGCDetails

datalocation: 25GB\s Ram Disk

uamadman on 26 Dec 2016

It is apparent for linestrings that are limited to a very small geo-area, the indexing speeds are acceptable for a large number of points in a linestring.

Building some random line strings with points limited to range:
private double minLat = 10.00d;
private double maxLat = 10.001d;
private double minLon = 10.00d;
private double maxLon = 10.0001d;

scales fairly well to hundreds of points. There is not an apparent "exponential" growth that happens when allowing the points to come from anywhere on the globe. That breaks down quickly and starts to show a non-linear growth at around 10 points, to where with even only a few dozen can take 30 seconds to index a single document, and attempting a 1000 will be over an hour.

(tested with 2.2.0 / 2.4.4)

@nknize - would you please follow up with what should be expected from line-string / multi-line-string indexing performance in regards to point count, point distance, and overall "complexity" of line shape.

joelstewart on 19 Mar 2017

I can also confirm that geo_shape indexing is very slow. I have just migrated code from ES 1.7.2 to 2.4.4 and, interestingly, performance seemed to be at least one order of magnitude faster in the old version. I was using geohash tree type (precision 50m). Changing to quadree in v2.4.4 helped a bit to speed things up, but it's still significantly slower than in v1.7.2.

As soon as I remove the shape field from my mapping (with dynamic set to false) insert speed is back to the normal high levels, so it's definitely the geo_shape. Any thoughts yet on possible optimization strategies?

Another observation - not sure if this is of any help to track this down: the majority of my "shapes" are in fact Points. (I still need to index them as geo_shapes, since my data contains a mix of points and polygons.) So in my case, we're definitely not talking about complex shapes at all.

rsimon on 27 Mar 2017

~Has anyone found a solution to this slow indexing of geo_shape?~

~I am using ES 5.4.2 docker container with the Java client with the bulk processor and indexing mostly text/number documents with some simple polygons. It is taking around 15+ minutes to index around 5000 docs.~

~Before adding the geo_shape mapping I was indexing 500,000 docs in \~2-3 minutes.~

The slow down appears to be caused by submitting obsolete/bad GeoJSON. The JTS library I am using was outputting GeoJSON using an incompatible CRS (Coordinate Reference System) namely EPSG:3857 (Pseudo Mercator/WGS 84) and was including the block:

"crs": { 
    "type": "name", 
    "properties": { 
        "name": "EPSG:3857" 
    }
}

By making sure the GeoJSON was output using the CRS of EPSG:4326 (WGS 84) as per the GeoJSON spec (see the "note" in section 4) indexing speed has returned to normal.

Hopefully this helps somebody and I haven't just wasted everyones time :)

EDIT: Indexing geo_shape is still much slower than normal text documents but not as slow as submitting bad GeoJSON

instagibb on 23 Jun 2017

I can also confirm that indexing geo shapes is very slow. When we try to index a geo shape line string with 9k points it takes around a minute to respond.

Using elasticsearch 2.4.

schema and example linestring attached

schema.txt
segment-9513.txt

gem360 on 4 Aug 2017

This is likely related to an issue in spatial4j/JTS when it is attempting to determine if a line covers a rectangle, see possible PR to address: https://github.com/locationtech/spatial4j/pull/144

schlosna on 24 Aug 2017

tldr: try setting tree to quadtree and distance_error_pct to 0.001

In short @schlosna is correct. There are a few things going on here that cause slow shape indexing. One is related to the number of terms (quad cells) and number of vertices. The more terms the more calls to jts.relate (which is slow). The more vertices the slower jts.relate becomes. Another is related to heap consumption in createCellIteratorToIndex where intersecting cells are collected into an in-memory list. See my related related comments in #25833 describing what's going on in lucene in a little more detail and what you can do (for now) to help with these issues.

There are a few patches coming to correct this issue in the near term (fix memory consumption and change tree defaults). In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).

nknize on 6 Oct 2017

👍3

In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).

+1. Would love to have an ETA on this, we are currently experiencing the same in a large scale project.

synhershko on 20 Dec 2017

👍6

closing in favor of #25833 and #16749

nknize on 26 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings