Graylog2-server: Search Fails with org.elasticsearch.action.search.SearchPhaseExecutionException

Created on 1 Oct 2016  路  17Comments  路  Source: Graylog2/graylog2-server

Search will fail with Error: org.elasticsearch.action.search.SearchPhaseExecutionException if I try to view ANYTHING past page 66. (Search time range can be anything that generates 66 pages or more) No search terms were used.

graylog-server/server.log shows these types of errors:

[SearchResource] Unable to execute search: all shards failed.
[transport] [blah blah blah] Transport response handler not found of id [blah]

No errors found in elasticsearch logs.

Web Interface of Graylog shows Zero problems with Index, or node.

Found a suggestion from issue regarding a much older version of graylog, cycle and delete old index. Once I have enough log messages built up, i will test to see if the problem is reproducible.

  • Graylog Version: 2.1.1
  • Elasticsearch Version: 2.4.0
  • MongoDB Version: 2.4.9-1ubuntu2
    Above software is FRESHLY Installed, and only in use for less than 2 days.
  • Operating System: Ubuntu 14.04.5 LTS
  • Browser version: Chrome/OSX
documentation known issue

Most helpful comment

I ran into the same problem, and it is due to the ElasticSearch index.max_result_window setting.

I could get to the real cause of the exception by setting a breakpoint in our search resource:

Result window is too large, from + size must be less than or equal to: [10000] but was [n]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter

Changing the max_result_window setting in the index (as described in this StackOverflow answer) fixed the issue for me, so I guess we could add some way of customizing this setting in new indices. We should also ensure the actual cause of the exception is logged in Graylog, as it doesn't show up in the Graylog or ES logs.

We could also try to set a proper default setting, but I guess it will depend on use cases. We also need to check how much an increase in the setting affects to the cluster performance.

By the way, searching in multiple indices where only one has an increased max_result_window may return some inconsistent results. In that case ElasticSearch may return a successful response with an empty list of results, even if one of the indices was affected by the max_result_window limit.

All 17 comments

After waiting for enough log entries, I tried again, and the error still occurs on page 67 and higher.

Downgraded Elasticsearch to 2.3.5... error still persists.

@mr6volt Could you please past the entire exception found in the graylog log file?

There could be a few sources, but most likely Elasticsearch does not have enough memory available. To retrieve later pages it needs to go through the entire prior set of lines, which can be resource intensive, depending on their nature.

That was actually one of my thoughts as well, so i increased the heap size for BOTH elasticsearch and graylog.

Here is the relevant config info:
/etc/default/elasticsearch : ES_HEAP_SIZE=2g (also tried 4g)
/etc/default/graylog-server: changed GRAYLOG_SERVER_JAVA_OPTS to -Xms2g -Xmx2g....

And the entries for the failed search:

graylog-server/server.log:
WARN [SearchResource] Unable to execute search: all shards failed

Nothing in the elasticsearch log.

We'll continue to try reproduce this, but so far we don't have enough information to debug this.

Additional info:

I can export to CSV with no problem.

Is it possible that the problem is with parsing the URL args?

Reduced Elasticsearch shard count from 4 to 1, cycled indexes, and waited for index to fill up...

Verified that I was looking at results from ONLY the new index...
Same error after page 66.

NEW ERROR IN THE LOG!

org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onFirstPhaseResult(AbstractSearchAsyncAction.java:206) ~[graylog.jar:?]
at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:152) ~[graylog.jar:?]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:46) ~[graylog.jar:?]
at org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:212) ~[graylog.jar:?]
at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:202) ~[graylog.jar:?]
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:136) ~[graylog.jar:?]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) ~[graylog.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[graylog.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) ~[graylog.jar:?]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296) ~[graylog.jar:?]
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462) ~[graylog.jar:?]
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443) ~[graylog.jar:?]
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303) ~[graylog.jar:?]
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) ~[graylog.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) ~[graylog.jar:?]
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) ~[graylog.jar:?]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) ~[graylog.jar:?]
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) ~[graylog.jar:?]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) ~[graylog.jar:?]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) ~[graylog.jar:?]
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337) ~[graylog.jar:?]
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) ~[graylog.jar:?]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) ~[graylog.jar:?]
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) ~[graylog.jar:?]
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) ~[graylog.jar:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_91]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_91]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_91]

I've been continuing my investigation, and I noticed some console errors that ONLY appear when trying to access page 67 or higher.

client.js error: http://myurl.com:9000/api/search/universal/relative?query=%2A&range=172800&limit=150&offset=9900&sort=timestamp%3Adesc - (400). bad request

FetchProvider.js error: http://myurl.com:9000/api/search/universal/relative?query=%2A&range=172800&limit=150&offset=9900&sort=timestamp%3Adesc - (400). Additional information: Unable to execute search

Actual response: {"type":"ApiError","message":"Missing required parameter fields"}

It looks like the URL's are getting truncated. Thoughts?

I've ran into this issue as well using the API to retrieve yesterdays messages (about 900.000 in total). The offset get raised by 500 each run which works fine until my offset is 10.000 after which the API returns:

url: /search/universal/absolute?from=2016-10-13+00%3A00%3A00&query=_exists_%3Asite+AND+_exists_%3Abytes&to=2016-10-14+00%3A00%3A00&offset=10000&fields=site%2Cbytes&limit=500

returned value:

{"query":"_exists_:site AND _exists_:bytes","begin_column":null,"begin_line":null,"end_column":null,"end_line":null,"message":"Unable to execute search","exception_name":"org.elasticsearch.action.search.SearchPhaseExecutionException"}

Latest error message with Graylog 2.1.2

Error Message:
Unable to fulfill search request: cannot GET http://myurl.com:9000/api/search/universal/relative?query=%2A&range=28800&limit=150&offset=9900&sort=timestamp%3Adesc (500)
Search status code:
500

I ran into the same problem, and it is due to the ElasticSearch index.max_result_window setting.

I could get to the real cause of the exception by setting a breakpoint in our search resource:

Result window is too large, from + size must be less than or equal to: [10000] but was [n]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level parameter

Changing the max_result_window setting in the index (as described in this StackOverflow answer) fixed the issue for me, so I guess we could add some way of customizing this setting in new indices. We should also ensure the actual cause of the exception is logged in Graylog, as it doesn't show up in the Graylog or ES logs.

We could also try to set a proper default setting, but I guess it will depend on use cases. We also need to check how much an increase in the setting affects to the cluster performance.

By the way, searching in multiple indices where only one has an increased max_result_window may return some inconsistent results. In that case ElasticSearch may return a successful response with an empty list of results, even if one of the indices was affected by the max_result_window limit.

The search functionality of EL (which Graylog obviously is utilizing) isn't intended for reading for lots of data if I recall correctly. To process more then 10.000 rows (or however your EL instance does allow) you're supposed to use the scroll functionality.

I've resorted to accessing the EL instance directly consuming the data with this feature and it work's as expected.

@hkraal reading this notice I don't think using scroll is what we want to do for searches:

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

@edmundoa it depends on what you're doing with searches of course. I used it for processing all my data during which acumulated over a period of a month. I suppose it should be two different endpoints in the API, each for their own purpose. Using a transparent method (switch from real time to scrolling when the offset > 10k) would probably give pains down the road.

Scrolling does not work in this case because part of what makes it different (and suitable for iterating over large result sets) is that you cannot sort the result. This is what happens when you export the search as csv, the data is in the insert sort order, by shard, which is definitely not what users would expect in the search result UI (many already complain about this behavior in the csv export).

I'm afraid elasticsearch is simply not geared towards deep pagination and the only workaround is to either export it (uses scroll) and post process the data, or use the setting mentioned above.

Only remains open until the docs are updated to 2.2 (or the change is merged into 2.1).

Was this page helpful?
0 / 5 - 0 ratings