Elasticsearch: Rare ArrayIndexOutOfBoundsException on all shards with large queries

Created on 17 Mar 2017  路  9Comments  路  Source: elastic/elasticsearch

Elasticsearch version: 5.2.2

Plugins installed:

  • EC2 Discovery
  • SearchGuard 5.2.2-11

JVM version: openjdk 64bit 1.8.0_91

OS version: Ubuntu 16.04

Description of the problem including expected versus actual behavior:
In ~1% of large queries which span multiple types, including nested ones, we experience a 500 caused by org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed caused by java.lang.ArrayIndexOutOfBoundsException.

We are running a small ES cluster of 5 nodes (no dedicated master nodes) on AWS spread over two availability zones (2 instances in one, 3 in the other). Each instance is running ES in a docker container with external volumes.

Nodes are configured to use the EC2 AZ aware shard allocation. (cluster.routing.allocation.awareness.attributes: aws_availability_zone)

First we though it might be a corrupted index. However, building a new index (not using the reindex API but basic bulk indexing) did not help. At the same time we switched to using custom routing with the new index. The error has been occurring more often since.

So far the error frequency clearly seems to be traffic dependent. It occurs more often during peak hours. It also tends to happen in batches. Sometimes there are no errors for hours and then suddenly a couple within a short timespan:
screen shot 2017-03-17 at 10 49 21

Any ideas of what's going on within these minutes?

Example trace:

[2017-03-17T08:21:10,125][DEBUG][o.e.a.s.TransportSearchAction] [elasticsearch5-petrel-0ff7463123456e6e7] [<%INDEXNAME%>][4], node[Tci4YB12345dEhZebFINRQ], [R], s[STARTED], a[id=30tD94F3S_mq7uwk12341w]: Failed to execute [SearchRequest{searchType=QUERY_AND_FETCH, indices=[<%ALIAS INDEXNAME%>], indicesOptions=IndicesOptions[id=38, ignore_unavailable=false, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_alisases_to_multiple_indices=true, forbid_closed_indices=true], types=[<%6 DIFFERENT TYPES%>], routing='<OUR CUSTOM ROUTING ID>', preference='null', requestCache=null, scroll=null, source={
  "from" : 0,
  "size" : 50,
  "query" : {
    <%VERY LARGE QUERY INCLUDING CONDITIONS ON NESTED TYPES%>
  },
  "stored_fields" : [
    "_id",
    "_type"
  ]
}}]
org.elasticsearch.transport.RemoteTransportException: [elasticsearch5-penguin-046876f8756512345][123.12.0.2:9300][indices:data/read/search[phase/query+fetch]]
Caused by: java.lang.ArrayIndexOutOfBoundsException
[2017-03-17T08:21:10,127][DEBUG][o.e.a.s.TransportSearchAction] [elasticsearch5-petrel-0ff7463123456e6e7] All shards failed for phase: [query_fetch]
org.elasticsearch.transport.RemoteTransportException: [elasticsearch5-penguin-046876f8756512345][123.12.0.2:9300][indices:data/read/search[phase/query+fetch]]
Caused by: java.lang.ArrayIndexOutOfBoundsException
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.onFirstPhaseResult(AbstractSearchAsyncAction.java:208) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:52) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:143) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:51) ~[elasticsearch-5.2.2.jar:5.2.2]
    at com.floragunn.searchguard.transport.SearchGuardInterceptor$RestoringTransportResponseHandler.handleException(SearchGuardInterceptor.java:153) ~[?:?]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1024) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport.lambda$handleException$17(TcpTransport.java:1411) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:109) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1409) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1401) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1345) [elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74) [transport-netty4-client-5.2.2.jar:5.2.2]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1139) [netty-handler-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:950) [netty-handler-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:411) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248) [netty-codec-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:341) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:349) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:129) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:642) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:527) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:481) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:441) [netty-transport-4.1.7.Final.jar:4.1.7.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.7.Final.jar:4.1.7.Final]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elasticsearch5-penguin-046876f8756512345][123.12.0.2:9300][indices:data/read/search[phase/query+fetch]]
Caused by: java.lang.ArrayIndexOutOfBoundsException
:SearcSearch >bug

All 9 comments

We need the stack trace for this ArrayIndexOutOfBoundsException. It is missing here as the JVM optimizes throwing repeatedly the same exception:

"The compiler in the server VM now provides correct stack backtraces for all "cold" built-in exceptions. For performance purposes, when such an exception is thrown a few times, the method may be recompiled. After recompilation, the compiler may choose a faster tactic using preallocated exceptions that do not provide a stack trace. To disable completely the use of preallocated exceptions, use this new flag: -XX:-OmitStackTraceInFastThrow."

Can you grep in your logs for the first occurrence of this exception?

[2017-03-18T15:23:24,661][DEBUG][o.e.a.s.TransportSearchAction] [elasticsearch5-flamingo-0640e123456614532] All shards failed for phase: [query_fetch]
org.elasticsearch.transport.RemoteTransportException: [elasticsearch5-gull-065b812345909563f][123.12.0.2:9300][indices:data/read/search[phase/query+fetch]]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 33439509
    at org.apache.lucene.util.FixedBitSet.get(FixedBitSet.java:186) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
    at org.elasticsearch.search.fetch.FetchPhase.findRootDocumentIfNested(FetchPhase.java:177) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:370) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:322) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:319) ~[elasticsearch-5.2.2.jar:5.2.2]
    at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceivedDecorate(SearchGuardSSLRequestHandler.java:184) ~[?:?]
    at com.floragunn.searchguard.transport.SearchGuardRequestHandler.messageReceivedDecorate(SearchGuardRequestHandler.java:171) ~[?:?]
    at com.floragunn.searchguard.ssl.transport.SearchGuardSSLRequestHandler.messageReceived(SearchGuardSSLRequestHandler.java:139) ~[?:?]
    at com.floragunn.searchguard.SearchGuardPlugin$2$1.messageReceived(SearchGuardPlugin.java:284) ~[?:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.2.jar:5.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_91]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_91]
  at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

@jpountz @jimczi thoughts?

The docId that is checked in the parent bitset should always be smaller than maxDoc though in your issue it seems that there is a discrepancy between the reader that is used for search and the one used to fetch the documents. Since this query executes the query and the fetch phase in a single pass this should never happen unless a third party plugin messes with the reader.
I checked a bit what SearchGuard is doing and it seems that the searcher/reader is wrapped in this plugin in order to add a security layer on top of ES search. This is an expert feature and I suspect that the wrapper breaks some assumption in Lucene/ES.
Considering that the SearchGuard plugin adds a search layer on top of ES, I'd advise to first see if the problem is related to SearchGuard (which I suspect) and open an issue there or to try to reproduce the problem without SearchGuard.
Either case I'll close this issue for now since there is no proof that ES is responsible for this. Please reopen if you find evidence that SearchGuard is not the cause of this.

Sorry guys, I just restarted our cluster without the SearchGuard plugin and the problem is still occurring:

[2017-03-27T03:52:56,546][DEBUG][o.e.a.s.TransportSearchAction] [elasticsearch5-petrel-0ff7463c013123456] All shards failed for phase: [query_fetch]
org.elasticsearch.transport.RemoteTransportException: [elasticsearch5-flamingo-0640eca9d37123456][123.12.0.2:9300][indices:data/read/search[phase/query+fetch]]
Caused by: java.lang.ArrayIndexOutOfBoundsException: 33423460
    at org.apache.lucene.util.FixedBitSet.get(FixedBitSet.java:186) ~[lucene-core-6.4.1.jar:6.4.1 72f75b2503fa0aa4f0aff76d439874feb923bb0e - jpountz - 2017-02-01 14:43:32]
    at org.elasticsearch.search.fetch.FetchPhase.findRootDocumentIfNested(FetchPhase.java:177) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.search.fetch.FetchPhase.execute(FetchPhase.java:150) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:370) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:322) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.action.search.SearchTransportService$9.messageReceived(SearchTransportService.java:319) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1488) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.2.jar:5.2.2]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.2.jar:5.2.2]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_91]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_91]
    at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

Fair enough @ojundt , reopen it is.

Are you able to access the problematic queries ? It would help if you can share a reproducible case with the complete query that throws this exception.
Otherwise can you share the mapping of your index and a complete example of your query:

"query" : {
<%VERY LARGE QUERY INCLUDING CONDITIONS ON NESTED TYPES%>
},

@jimczi, thanks for reopening. I'm able to access the failing queries but I'm afraid I'm not allowed to share them here. Mappings maybe. Is there something specific you are looking for or that I can test for you?

As far as I can see now the same query succeeds after a few minutes so it's hard to reproduce. I will have a closer look at the end of the week.

Is there something specific you are looking for or that I can test for you?

It can be anything so we'll need to narrow the scope. Can you share the query tree of the failing query, just the query types without the content:

"query": { "bool": { "must": { "term": { // private infos } } }

I suspect that one of your inner query returns a docID greater than maxDoc. This should never happen so the query tree should filter the list of candidates to lookup.

Alright, after some confusion and further drilling down I was able to consistently reproduce the error with the following setup:

Query (in query.json):

{
  "query": {
    "function_score": {
      "functions": [
        {
          "script_score": {
            "script": {
              "lang": "painless",
              "inline": "-1 / doc['value'].value"
            }
          }
        }
      ]
    }
  }
}

Setup:

curl -XPUT 'localhost:9205/twitter/tweet/1?pretty' -H 'Content-Type: application/json' -d'{ "value" : 0.0 }'
curl -XPOST 'localhost:9205/twitter/_search?pretty' -H 'Content-Type: application/json' -d @query.json

Output:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 4,
    "failed" : 1,
    "failures" : [
      {
        "shard" : 3,
        "index" : "twitter",
        "node" : "9D1hkKznR867YomO_6WtgA",
        "reason" : {
          "type" : "index_out_of_bounds_exception",
          "reason" : "docID must be >= 0 and < maxDoc=1 (got docID=2147483647)"
        }
      }
    ]
  },
  "hits" : {
    "total" : 1,
    "max_score" : null,
    "hits" : [ ]
  }
}

Now you may argue that division by zero is stupid anyway. Agree, but I'd expect a much different error message. Also, if you change -1 / doc['value'].value by 1 / doc['value'].value it suddenly works:

{
  "took" : 10,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 3.4028235E38,
    "hits" : [
      {
        "_index" : "twitter",
        "_type" : "invoice",
        "_id" : "1",
        "_score" : 3.4028235E38,
        "_source" : {
          "value" : 0.0
        }
      }
    ]
  }
}

Hope this gives you guys a clear test case :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abrahamduran picture abrahamduran  路  3Comments

matthughes picture matthughes  路  3Comments

DhairyashilBhosale picture DhairyashilBhosale  路  3Comments

rpalsaxena picture rpalsaxena  路  3Comments

dadoonet picture dadoonet  路  3Comments