I'm seeing very slow indexing with ElasticSearch when I set up a geo-shape type.
I'm using a stock 5.0.0 on Windows 7, no customisations, and I'm accessing it via the Python (3.5) wrapper. One shard.
I have 75 documents to index and if I don't set up a mapping, they index near instantly.
But the second I include a geo-shape mapping, the processing time rockets, as does CPU usage and inevitably it times out on the index creation, even with the timeout increased to 30s.
My mapping for the geo-shape looks like this:
"bbox": {
"type": "geo_shape",
"precision": "100m"
},
The rest of the mapping is just type:text. _all is disabled.
Each document has up to two geo-shapes. The shapes are simple envelope types as they're only bounding boxes - spatially this should be the simplest indexing there is for a polygon.
The problem is exagerated by the precision, but even with a precison of 1000m, when I send in larger quantiies of documents, it still flakes out after only a few hundred have been indexed.
Changing the tree type to quadtree is several times faster than the geohash, but it still times out after using considerably more resources.
I've seen the section on "performance considerations" in the docs (https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-shape.html ) - but this is far slower than I would expect any spatial indexing to occur.
In a conventional spatial RDBMS (PostGIS/Oracle) I'm used to being able to insert hundreds if not thousands of complex spatial features every second.
ElasticSearch is failing to handle even a few dozen a second of basic bounding boxes geometries.
I appreciate I haven't optimised my deployment (still in dev), but an unoptimised RDBMS is orders of magnitude faster at the same operation. I think there's considerable scope for optimisation here, especially compared to ES's non-spatial indexing.
@nknize could you provide some wisdom please
I'm experiencing something similar. I've created a brutish python script for showcasing this issue.
The following shows a simple python script that creates a GeoShape mapping and adds data.
This method indexes roughly 5 line-strings per second/ 2 minutes per 1000, while triggering the GC Allocation failure for the entirety of the load.
Of my current single node systems I would expect a minimum 1000 geo_shapes per second.
from random import random, randint
import math
import time
try:
from elasticsearch import Elasticsearch
es = Elasticsearch()
except ImportError:
quit()
es_index = 'assets'
def get(query, index=es_index, es=es):
body = {
"query": query,
"size": 1000
}
ret = es.search(index=index, body=body)
return ret
def put(uuid, json, doc, index=es_index, es=es):
ret = es.create(index=index,
doc_type=doc,
id=uuid,
body=json)
return ret
def delete_index(index=es_index, es=es):
ret = es.indices.delete(index=index)
return ret
def create_index(index=es_index, es=es, number_of_shards=1, number_of_replicas=0):
settings = {
"settings": {
"number_of_shards": number_of_shards,
"number_of_replicas": number_of_replicas
},
"mappings": {
"player": {
"properties": {
"PATH": {
"type": "geo_shape"
# "precision": "10m"
}
}
}
}
}
ret = es.indices.create(index=index, body=settings)
return ret
def random_elastic_linestring(nodes, mx, my, Mx, My):
linestring = []
for n in range(0, nodes):
linestring.append([random.randint(mx, Mx), random.randint(my, My)])
es_doc = {
"type": "linestring",
"coordinates": linestring
}
return es_doc
if __name__ == "__main__":
try:
delete_index()
except:
print("It doesn't exist")
create_index()
for i in range(1, 1000):
x = (random() - .5) * 1000
y = (random() - .5) * 1000
a = {'id': 'randomname' + str(i),
'asset': 'randomname',
'Zone': 'X' + str(math.floor(x)) + 'Y' + str(math.floor(y)),
'Travel_Rate': randint(400, 2000),
'Draw_Level': 1,
'HOME': None,
'PATH': random_elastic_linestring(randint(2, 10), -5, -5, 5, 5),
'DEL': False,
'_time': time.time()}
put(a['id'], a, 'player')
Elastic 5.1.1
JAVA_HOME: jdk 1_8_112
JVM Settings:
Xms4g
Xmx4g
-XX:+PrintGCDetails
datalocation: 25GB\s Ram Disk
It is apparent for linestrings that are limited to a very small geo-area, the indexing speeds are acceptable for a large number of points in a linestring.
Building some random line strings with points limited to range:
private double minLat = 10.00d;
private double maxLat = 10.001d;
private double minLon = 10.00d;
private double maxLon = 10.0001d;
scales fairly well to hundreds of points. There is not an apparent "exponential" growth that happens when allowing the points to come from anywhere on the globe. That breaks down quickly and starts to show a non-linear growth at around 10 points, to where with even only a few dozen can take 30 seconds to index a single document, and attempting a 1000 will be over an hour.
(tested with 2.2.0 / 2.4.4)
@nknize - would you please follow up with what should be expected from line-string / multi-line-string indexing performance in regards to point count, point distance, and overall "complexity" of line shape.
I can also confirm that geo_shape
indexing is very slow. I have just migrated code from ES 1.7.2 to 2.4.4 and, interestingly, performance seemed to be at least one order of magnitude faster in the old version. I was using geohash tree type (precision 50m). Changing to quadree in v2.4.4 helped a bit to speed things up, but it's still significantly slower than in v1.7.2.
As soon as I remove the shape field from my mapping (with dynamic
set to false
) insert speed is back to the normal high levels, so it's definitely the geo_shape. Any thoughts yet on possible optimization strategies?
Another observation - not sure if this is of any help to track this down: the majority of my "shapes" are in fact Points. (I still need to index them as geo_shapes, since my data contains a mix of points and polygons.) So in my case, we're definitely not talking about complex shapes at all.
~Has anyone found a solution to this slow indexing of geo_shape
?~
~I am using ES 5.4.2 docker container with the Java client with the bulk processor and indexing mostly text/number documents with some simple polygons. It is taking around 15+ minutes to index around 5000 docs.~
~Before adding the geo_shape
mapping I was indexing 500,000 docs in \~2-3 minutes.~
The slow down appears to be caused by submitting obsolete/bad GeoJSON. The JTS library I am using was outputting GeoJSON using an incompatible CRS (Coordinate Reference System) namely EPSG:3857 (Pseudo Mercator/WGS 84) and was including the block:
"crs": {
"type": "name",
"properties": {
"name": "EPSG:3857"
}
}
By making sure the GeoJSON was output using the CRS of EPSG:4326 (WGS 84) as per the GeoJSON spec (see the "note" in section 4) indexing speed has returned to normal.
Hopefully this helps somebody and I haven't just wasted everyones time :)
EDIT: Indexing geo_shape
is still much slower than normal text documents but not as slow as submitting bad GeoJSON
I can also confirm that indexing geo shapes is very slow. When we try to index a geo shape line string with 9k points it takes around a minute to respond.
Using elasticsearch 2.4.
schema and example linestring attached
This is likely related to an issue in spatial4j/JTS when it is attempting to determine if a line covers a rectangle, see possible PR to address: https://github.com/locationtech/spatial4j/pull/144
tldr: try setting tree
to quadtree
and distance_error_pct
to 0.001
In short @schlosna is correct. There are a few things going on here that cause slow shape indexing. One is related to the number of terms (quad cells) and number of vertices. The more terms the more calls to jts.relate
(which is slow). The more vertices the slower jts.relate
becomes. Another is related to heap consumption in createCellIteratorToIndex
where intersecting cells are collected into an in-memory list. See my related related comments in #25833 describing what's going on in lucene in a little more detail and what you can do (for now) to help with these issues.
There are a few patches coming to correct this issue in the near term (fix memory consumption and change tree defaults). In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).
In the long term we are working on a new geo field type based on Bkd tree that avoids the rasterization approach altogether (which has the added bonus of eliminating the jts and s4j dependencies).
+1. Would love to have an ETA on this, we are currently experiencing the same in a large scale project.
closing in favor of #25833 and #16749
Most helpful comment
+1. Would love to have an ETA on this, we are currently experiencing the same in a large scale project.