Elasticsearch: Elasticsearch client node crashing with java.lang.StackOverflowError when running percentile calculations for long data sets

Created on 6 Feb 2017 · 10Comments · Source: elastic/elasticsearch

Elasticsearch version:
{
"number" : "5.2.0",
"build_hash" : "24e05b9",
"build_date" : "2017-01-24T19:52:35.800Z",
"build_snapshot" : false,
"lucene_version" : "6.4.0"
}

Plugins installed: [reindexing,
x-pack]

JVM version:
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)

OS version:
debian 8.5

Description of the problem including expected versus actual behavior:
Elastisearch client node used by kibana to send requests to the cluster crashes with

[2017-02-06T18:07:25,659][INFO ][o.e.m.j.JvmGcMonitorService] [dashboard] [gc][7] overhead, spent [346ms] collecting in the last [1s]
[2017-02-06T18:18:24,961][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [dashboard] fatal error in thread [elasticsearch[dashboard][search][T#28]], exiting
java.lang.StackOverflowError: null
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:77) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.compress(AVLTreeDigest.java:165) ~[t-digest-3.0.jar:?]
        at com.tdunning.math.stats.AVLTreeDigest.add(AVLTreeDigest.java:135) ~[t-digest-3.0.jar:?]

Steps to reproduce:

Running the following query from Kibana causes elasticsearch client node to crash

```
{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.24","rum_custom-2017.01.25","rum_custom-2017.01.26","rum_custom-2017.01.27","rum_custom-2017.01.22","rum_custom-2017.01.23","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486404140629}
{"query":{"bool":{"must":[{"query_string":{"query":"v_entryType: "mark"","analyze_wildcard":true}},{"range":{"@timestamp":{"gte":1485108407472,"lte":1486404407472,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"12h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}}



 2. The same query for a shorter period of time runs just fine

{"index":["rum_custom-2017.02.03","rum_custom-2017.02.04","rum_custom-2017.02.05","rum_custom-2017.02.06","rum_custom-2017.01.31","rum_custom-2017.02.01","rum_custom-2017.02.02","rum_custom-2017.01.30","rum_custom-2017.01.27","rum_custom-2017.01.28","rum_custom-2017.01.29"],"ignore_unavailable":true,"preference":1486407007120}
{"query":{"bool":{"must":[{"query_string":{"analyze_wildcard":true,"query":"v_entryType: "mark""}},{"range":{"@timestamp":{"gte":1485543020838,"lte":1486407020838,"format":"epoch_millis"}}}],"must_not":[]}},"size":0,"_source":{"excludes":[]},"aggs":{"2":{"date_histogram":{"field":"@timestamp","interval":"3h","time_zone":"Europe/Berlin","min_doc_count":1},"aggs":{"3":{"terms":{"field":"v_name","size":10,"order":{"_term":"desc"}},"aggs":{"1":{"percentiles":{"field":"v_startTime","percents":[50],"keyed":false}}}}}}}}
```

Provide logs (if relevant):

:AnalyticAggregations >bug help wanted

Source

xo4n

👍2

Most helpful comment

@xo4n thanks for providing the index to me. I managed to reproduce the issue and it seems to be down to the out of range values which are represented as Infinity in the doc_values. We think that the right fix here is to stop out of range values at index time by rejecting the request so the infinite values don't end up in the index. I opened https://github.com/elastic/elasticsearch/issues/25534 for this

colings86 on 4 Jul 2017

👍2

All 10 comments

Hi @xo4n , when you say "The same query for a shorter period of time" do you mean that you set the interval smaller which produces less results per aggregation, or do you mean you query not so much indices?

martinscholz83 on 9 Feb 2017

@colings86 any ideas?

clintongormley on 9 Feb 2017

@xo4n It looks like it's getting stuck in a compression event and not making any progress.

To explain this let me describe a little about the T-Digest algorithm thats used for the percentiles aggregation: As the aggregation collects values from the matching documents it creates a histogram of centroids that combine similar values (the method of creating these centroids is not important here but if you are interested I would advise reading the t-digest paper. These centroids continue being created as more documents are created and at some point a compress event is triggered to reduce the number of centroids back to a manageable number. The compression strategy is to re-insert all of the centroids in a random order, which because of the way values are combined should result in less centroids being created.

In your case it seems to be that a compression event has been triggered and while re-inserting the centroids its is recursively triggering another compression event which in turn is triggering another compression event and so on until it overflows the stack. This is something I have not seen before.

Unfortunately, this is probably going to be a little tricky to reproduce since getting the same failure will require having the same data inserted into the algorithm, so please bear with me but I will try to get to the bottom of this for you.

The good news is that because you have time based indices and your top level aggregation is a date_histogram we should be able to filter this down to a single index and a single 12 hour period within that index. Could you please run the same query against the following indices, one index at a time (its best if you query Elasticsearch through something like Console for this rather than modfying your visualisation in Kibana) ?

"rum_custom-2017.01.22"
"rum_custom-2017.01.23"
"rum_custom-2017.01.24"
"rum_custom-2017.01.25"
"rum_custom-2017.01.26"

Doing this should hopefully tell you which of these indices is causing the issue (my assumption is it is not the indices from 2017.01.27 to 2017.02.06 since the query across these indices worked).

Once you know which index is causing the issue you will have reduced the problem to two 12 hour buckets. The next step will be to modify the range query in your search request to search from 00:00:00 to 12:00:00 on that day and run the request to see if the issue reproduces, if not change the query to 12:00:00 on that day to 00:00:00 the following day.

At this point you should have narrowed the problem down to a 12 hour period in a single index. If you are happy you can share the index with me privately (I can give you details if you let me know you want to go this route) I could then try to debug the problem. Is this something you would be willing to do? If so, could you let me know how big the index is once you have identified it using the steps above?

colings86 on 10 Feb 2017

👍1

We figured out which data could be causing this

We were mapping some fields to half_float that actually contain bigger values than what a half_float can handle. Not sure what value Elasticsearch indexes in such cases but when running percentile calculations on a field with several of these, the issue is reproducible.

Below there is a data example. We are sending key value fields as strings and mapping to half_float the numeric ones. Fields like "v_startTime" are holding very high values in some cases and those are causing the issue. When sorting those fields already, the ones causing trouble have "sort": ["Infinity" ] instead of its numeric value

Table View ( excerpt)

# v_duration        5,862.045
t v_entryType       measure
t v_name        gtm_time_to_first_tag_item
# v_startTime       75,810.745

JSON View

{
    "_index": "rum_custom-2017.05.23",
    "_type": "logs",
    "_id": "AVwy0ptHX7d39WTa_HhR",
    "_score": null,
    "_source": {
        "geoip": {
            "country_code2": "IN",
            "continent_code": "AS"
        },
        "ua_build": "83",
        "v_startTime": "75810.745",
        "ua_os_major": "4",
        "ua_os": "Android 4.4.4",
        "locale": "IN",
        "ua_patch": "3029",
        "ua_minor": "0",
        "@timestamp": "2017-05-23T01:01:21.000Z",
        "ua_name": "Chrome Mobile",
        "ua_os_minor": "4",
        "ua_major": "58",
        "ua_device": "HM NOTE 1S",
        "v_entryType": "measure",
        "user_agent": "Mozilla/5.0 (Linux; Android 4.4.4; HM NOTE 1S Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.83 Mobile Safari/537.36",
        "v_name": "gtm_time_to_first_tag_item",
        "v_duration": "5862.045",
        "ua_os_name": "Android"
    },
    "fields": {
        "@timestamp": [
            1495501281000
        ]
    },
    "sort": [
        "Infinity"
    ]
}

We could attach some more data if needed

xo4n on 24 May 2017

@xo4n thanks for the update. This is interesting. My current working theory is that the problem is due caused when you insert the same value into the t-digest algorithm load of times (in your case I think it's inserting either HalfFloat.MAX_VALUE or HalfFloat.POSITIVE_INFINITY for each of the docs that have a value outside the half float range). I'm going to try to reproduce this in a test case. I'll report back what I find.

colings86 on 24 May 2017

@xo4n also, do you have the full stack trace for this? Could you paste it in a gist (not here as it will be very long) and link it here please? could you also let me know how many documents match the search request that fails (you can find this by running the query without percentiles agg and looking at the hits.total field)? and also do you know how many documents contain values for "v_startTime" which are outside of the half_float range?

colings86 on 24 May 2017

StackTrace
https://gist.github.com/xo4n/7a7ada8a34dfda351c62a653e00c9761

Query
https://gist.github.com/xo4n/531e125418d0c9eee0b38471520c98d0

hits.total:8742983
out of range values:69,568

xo4n on 26 May 2017

@xo4n unfortunately I haven't yet had any luck in reproducing this. I'm wondering if you would be willing/able to privately share your index with me to assist me in reproducing this? If you are not able to share the full index maybe you could use the reindex API to create an index where the documents only contain the field v_startTime and share that index with me?

colings86 on 30 May 2017