Elasticsearch: SearchContextMissingException: No search context found for id [xx]

Created on 17 Feb 2015 · 43Comments · Source: elastic/elasticsearch

I'm using Elasticsearch 1.3.4. Occasionally I see the following DEBUG stack trace (few of them are logged sequentially) in the log file. No search requests are failing around the same time though. There was nothing wrong with the cluster too when this exception was logged. Is this exception harmless? If not, what are the repercussions? I did some digging around the same exception on the internet and in the Issues list of Elasticsearch but I found most of those issues were related to scan and scroll. We don't use scan and scroll search requests. Let me know if you need more data.

DEBUG!!default!![2015-02-16 19:41:54,432][DEBUG][action.search.type       ] [<Query Node Name>] [27419] Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [<Data Node Name>][inet[/xx.x.x.xxx:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [27419]
    at org.elasticsearch.search.SearchService.findContext(SearchService.java:481)
    at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:451)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:793)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:782)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)

feedback_needed

Source

bittusarkar

👍1

Most helpful comment

The version I use is 2.4.1. This issue also appears.

OnlyGit on 14 Oct 2016

👍6

All 43 comments

+1 started getting this, I'm on version 1.4.4; there are no requests in my app logs, I can't even figure out what's got ES spinning and causing these things but I generated about 10,000 lines of logs with this same error.

philbarresi on 4 Mar 2015

+1 here the same getting this error on 1.4.4

ChrisHubinger on 12 Mar 2015

when you are using scan/scroll are those requests taking a long time to finish ie. are you keeping the scroll request open a long time before going back to get the next batch? can you tell me a bit more about the usage pattern?

s1monw on 12 Mar 2015

in my case the strange thing is that i did not use scan or scroll at all.
ES is running using docker containers on two AWS t2.small Machines - it is a dev/test environment.

ChrisHubinger on 12 Mar 2015

This is causing problems for us. The scroll_id is being recognised by the master search node, but, intermittently, all the data nodes seems to lose the context for the scroll. Here is an example response, with status 500: {"_scroll_id":"c2NhbjswOzE7dG90YWxfaGl0czoxODs=","took":15,"timed_out":false,"_shards":{"total":20,"successful":0,"failed":20,"failures":[{"status":500,"reason":"RemoteTransportException[[es-trk-data-52.otherlevels.com][inet[/10.107.13.237:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [614221475]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-69.otherlevels.com][inet[/10.107.13.212:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [735000705]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-72.otherlevels.com][inet[/10.107.13.208:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [682782622]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-72.otherlevels.com][inet[/10.107.13.208:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [682782623]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-70.otherlevels.com][inet[/10.107.13.214:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [654543614]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-53.otherlevels.com][inet[/10.107.13.241:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [683625239]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-51.otherlevels.com][inet[/10.107.13.239:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [645331945]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-66.otherlevels.com][inet[/10.107.13.206:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [729378224]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-67.otherlevels.com][inet[/10.107.13.218:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [529068847]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-52.otherlevels.com][inet[/10.107.13.237:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [614221476]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-53.otherlevels.com][inet[/10.107.13.241:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [683625238]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-65.otherlevels.com][inet[/10.107.13.204:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [716845006]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-54.otherlevels.com][inet[/10.107.13.245:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [590371336]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-69.otherlevels.com][inet[/10.107.13.212:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [735000706]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-60.otherlevels.com][inet[/10.107.13.194:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [691684634]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-60.otherlevels.com][inet[/10.107.13.194:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [691684635]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-69.otherlevels.com][inet[/10.107.13.212:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [735000707]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-73.otherlevels.com][inet[/10.107.13.222:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [501628671]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-54.otherlevels.com][inet[/10.107.13.245:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [590371338]]; "},{"status":500,"reason":"RemoteTransportException[[es-trk-data-54.otherlevels.com][inet[/10.107.13.245:9300]][search/phase/scan/scroll]]; nested: SearchContextMissingException[No search context found for id [590371337]]; "}]},"hits":{"total":18,"max_score":0,"hits":[]}}

renewooller on 26 Mar 2015

@renewooller this sounds like you are not using the new scroll ID returned with the previous scroll request.

clintongormley on 5 Apr 2015

👍3

@clintongormley Going back to the original issue, we don't use scan and scroll but still see this exception resulting in search request failures. Do you know what can cause this issue? Can additional logging help debug this issue? If yes, do let me know what logging needs to be enabled.

bittusarkar on 18 Apr 2015

@bittusarkar are you seeing long GC (garbage collection) warnings in the logs?

clintongormley on 25 Apr 2015

Seeing this as well on found.no with parallel scan queries

NotFoundError: TransportError(404, u'{"_scroll_id":"c2NhbjswOzE7dG90YWxfaGl0czoxMzM7","took":9,"timed_out":false,"_shards":{"total":1,"successful":0,"failed":1,"failures":[{"status":404,"reason":"SearchContextMissingException[No search context found for id [15839]]"}]},"hits":{"total":133,"max_score":0.0,"hits":[]}}')

sylvinus on 11 May 2015

@sylvinus you can't use scan/scroll in parallel - the scroll ID from the previous scroll request needs to be passed to the next scroll request. Instead, you can use multiple scan/scroll requests each with a filter to segment the original index into (eg) documents from different months.

clintongormley on 7 Jun 2015

+1 We get same SCROLL - erorrs. We not use scroll queries.
But i not see such erorrs on development node. The difference is that dev is singleton node without marvel logging and production is 4-node cluster 2x2 with marvel installed.

comdiv on 12 Jul 2015

I've done some digging. First, a search request comes into a coordinating node. It forwards the request to each involved shard to execute the query phase. Once all shards have responded, it reduces the results then fetches results from each shard in the fetch phase.

If a single shard takes a long time to return from the query phase, then the search context might have expired on the other shards, resulting in a search context missing exception during the fetch phase. The search context is kept alive for 5 minutes by default before being released, so this points to one node being completely unresponsive (eg due to long GC) for an extended period.

The keep alive duration is affected by the search timeout setting.

Questions:

Are you seeing log messages about slow garbage collections?
Are you using the search timeout parameter?

clintongormley on 13 Jul 2015

I and first author can repeat it again - WE NOT USING SCROLL QUERIES when this errors occured. )))) so we not use timeout parameter. )))

comdiv on 13 Jul 2015

@comdiv The search API also supports the timeout parameter. My previous comment doesn't refer to scroll requests at all. If you're not using timeouts, then please look in your logs for slow garbage collections.

clintongormley on 13 Jul 2015

👍1

@clintongormley Sorry for the delayed response. Thanks a lot for explaining the scenario when this error can occur. I did check our logs but did not find any significant GC activity around the time when this error occurred. We do not use timeout with search API but if by default it is 5 minutes, I think we can live with that.

bittusarkar on 15 Jul 2015

@bittusarkar i'm trying to figure out why these exceptions are being thrown. Essentially a search context needs to be inactive for 5 mins (or the timeout) for it to be purged, so if you're not using timeouts and you don't have really slow GCs, then I'm not sure where this could come from...

clintongormley on 15 Jul 2015

Removing the discuss label for now since waiting for feedback currently.

cbuescher on 31 Jul 2015

I seem to be affected by this issue. Apologies if this is something different.

I'm copying data from one index to another using the scan and scroll API through PHP.

The code loops through the documents, getting the scroll id and then pulling the documents out and it works fine until the last c.10% of the documents, where the client returns false and this error is logged:

[2015-08-05 16:18:38,653][DEBUG][action.search.type       ] [client-node] [455550] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [dev-es-05-02][inet[/192.168.119.232:9301]][indices:data/read/search[phase/scan/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [455550]
        at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
        at org.elasticsearch.search.SearchService.executeScan(SearchService.java:265)
        at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:939)
        at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:930)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2015-08-05 16:18:38,653][DEBUG][action.search.type       ] [client-node] [455882] Failed to execute query phase
org.elasticsearch.transport.RemoteTransportException: [dev-es-05-01][inet[/192.168.119.232:9300]][indices:data/read/search[phase/scan/scroll]]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [455882]
        at org.elasticsearch.search.SearchService.findContext(SearchService.java:537)
        at org.elasticsearch.search.SearchService.executeScan(SearchService.java:265)
        at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:939)
        at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:930)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

There are 2 instances on the server dev-es-05, named dev-es-05-01 and -02 accordingly. Nothing is logged on that server at the time and 2 separate JVM processes erroring at the same time would indicate it's the local client node causing the error rather than the target node mentioned in the errors.

I've tried upping the scroll param from '30s' to '2m' which doesn't fix it. I'm using size 10000 and will try with various other sizes.

The versions are 1.7.0 and Oracle JVM 1.8.0_31.

I can give sample code to show how I'm doing the scan and scroll, but essentially it's just like the sample code

$docs = $pull_client->search($search_params);   // Execute the search
    $scroll_id = $docs['_scroll_id'];        // The response will contain no results, just a _scroll_id
while (\true) {
        $response = $pull_client->scroll(
            array(
                "scroll_id" => $scroll_id,  //...using our previously obtained _scroll_id
                "scroll" => "5m"           // and the same timeout window
            )
        );
        foreach ( $response['hits']['hits'] as $doc ) {
           // ...do stuff...
        }
        $scroll_id = $response['_scroll_id'];
}

dirk1978 on 5 Aug 2015

👍1

@dirk1978 you should stop pulling when you get back zero hits.

clintongormley on 5 Aug 2015

👍1

@clintongormley good shout. I was following this: https://github.com/elastic/elasticsearch-php/issues/138#issuecomment-58014781 which is obviously not the best way to do it.

edit: In fact, sorry, I do stop pulling when I get back zero hits:

$hits = count($response['hits']['hits']); if ($hits > 0) { // If yes, Do Work Here $dataload->debug('received '.$hits.' hits'); foreach ( $response['hits']['hits'] as $doc ) {

I left it out of the snippet above for brevity.

dirk1978 on 6 Aug 2015

@dirk1978 if you look at https://github.com/elastic/elasticsearch-php/issues/138#issuecomment-58014781 again, you will see this line:

    if (count($response['hits']['hits']) > 0) {

clintongormley on 6 Aug 2015

Closing due to lack of feedback

jpountz on 26 Aug 2015

I have the same issue when I search on several indices and then scroll the result. Also happens with an alias that points to several indices.

Workaround: run the search for each index separately.

pmaury on 8 Oct 2015

@pmaury If you're getting this error, then please create a new issue with the way to reproduce it. Specifically, please detail how you are getting and setting the scroll ID for each request.

pickypg on 17 Oct 2015

ok... I found the source of my problem. It turns out the bug is in my code... sorry.

As mentioned earlier in this thread, and in the documentation:
"The initial search request and each subsequent scroll request returns a new _scroll_id — only the most recent _scroll_id should be used"

I failed to notice the problem because from the client perspective, everything is working fine: no exception, no error, and all the results are returned (it seems at least). The only trace of the problem is a log on the server side.

pmaury on 20 Oct 2015

I'm getting this error using Logstash to pull from one Elasticsearch index to another index. It's causing problems because when this error occurs, Logstash creates duplicate entries in the target index.

tom-christie on 2 Feb 2016

Does using the scroll_id reset its ttl? My timeout is set to 10m, and processing all the results takes longer than 10m, _but the time in between_ each use of the scroll_id is definitely under 10m.

cjbottaro on 10 Feb 2016

@cjbottaro no - you have to pass a ?scroll=10m with every scroll request to extend the scroll for another 10 minutes

clintongormley on 13 Feb 2016

❤1

We are getting this issue too.

We are running ES v 1.5.2. Similar to bittusarkar, we aren't using scan and scroll, or timeouts in our search requests, and we don't see any reference to the GC being slow in our logs.

This message got spammed to our logs for a whole day this weekend, with about 5 different search context ids. It happened just over a month ago, for a shorter period of time. It only seems to affect one cluster.

We suspect it might be the Marvel plugin (v.1.3.1), as it has only been intermittently used in the past, but this weekend it was left in an open window overnight.

To try and find out more, we will attempt the following:

Set an alert on that particular message so we are aware of it as soon as it happens next
When it does, enable slowlog for a bit to see what queries we are receiving, and try closing Marvel to see if it fixes it
If it is Marvel, I guess we will try updating it?

Does that sound about right @clintongormley?

[2016-02-27 03:50:59,999][DEBUG][action.search.type ] [<Master node name>] [324418689] Failed to execute query phase org.elasticsearch.transport.RemoteTransportException: [<Non-master node name>][inet[/xx.x.xx.xx:9300]][indices:data/read/search[phase/scan/scroll]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [324418689] at org.elasticsearch.search.SearchService.findContext(SearchService.java:522) at org.elasticsearch.search.SearchService.executeScan(SearchService.java:250) at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:939) at org.elasticsearch.search.action.SearchServiceTransportAction$SearchScanScrollTransportHandler.messageReceived(SearchServiceTransportAction.java:930) at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:277) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

isabel12 on 29 Feb 2016

Hi @isabel12

This issue appears to have been resolved in later versions as we haven't seen any reports of this for a while. 1.5.2 is pretty old, I suggest upgrading

clintongormley on 29 Feb 2016

Thanks @clintongormley. Yeah, that makes sense. I guess it seems harmless enough, and we will upgrade sometime soon. Good to know upgrading will probably resolve it.

isabel12 on 1 Mar 2016

getting the same in version

"version" : {
"number" : "2.2.0",
"build_hash" : "8ff36d139e16f8720f2947ef62c8167a888992fe",
"build_timestamp" : "2016-01-27T13:32:39Z",
"build_snapshot" : false,
"lucene_version" : "5.4.1"
},

fcallejon on 3 Mar 2016

@fcallejon there are all sorts of things discussed above. if you're still seeing this in 2.2.0, please open an issue describing your cluster, settings, what you are doing, and exactly what exceptions you're seeing

clintongormley on 4 Mar 2016

I'm bulk transferring data from a ES 1.4 index to ES 2.1 using Elasticsearch python client. As I'm scanning and scrolling through the index, I bulk-index into the new cluster and I get a SearchContextMissingException[No search context found for id [...]]. Here's the Python code:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import scan, bulk
import json

es_source = Elasticsearch('<url1>', timeout=120)
es_target = Elasticsearch('<url2>', timeout=120)

source_index = '<source_index_name>'
target_index = '<target_index_name>'

with open('target_mapping.json') as mappingfile:
    mapping = json.load(mappingfile)
    es_target.indices.create(target_index, body=mapping)

def generate_actions(docs):
    for doc in docs:

        # Some transformation applied to the doc to conform to target mapping

        doc['_index'] = target_index

        yield doc

docs = scan(client=es_source, index=source_index, doc_type='<source_doc_type>')
actions = generate_actions(docs)
bulk(es_target, actions, raise_on_error=False, raise_on_exception=False)

Source cluster is running ES 1.4 on 3 nodes, each being a Linux VM with 32g of RAM and 2 cores. Target cluster runs ES 2.1 on 3 nodes, each being a Linux VM with 64g of RAM and 4 cores. Cluster settings are the default ones and I allocated 50% of RAM to the ES process on each node.

cwarny on 4 Mar 2016

@cwarny would be helpful to trace what is actually happening in curl (requests and responses, along with timestamps). Also, it may be that there was a bug in 1.4 and above which was causing this. I have yet to see a confirmed report of this in 2.x

clintongormley on 5 Mar 2016

@cwarny plus the elasticsearch logs

clintongormley on 5 Mar 2016

The version I use is 2.4.1. This issue also appears.

OnlyGit on 14 Oct 2016

👍6

I am trying to migrate elasticsearch index from 1.4 to 5.1 and getting this error

shard [_na], reason [RemoteTransportException[[es-data-nqa-normal-elastic-search-4-1d][inet[/x.x.x.x:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [70501051]]; ]
shard [_na], reason [RemoteTransportException[[es-data-nqa-normal-elastic-search-4-1d][inet[/x.x.x.x:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [70501053]]; ]
shard [_na], reason [RemoteTransportException[[es-data-nqa-normal-elastic-search-6-1d][inet[/x.x.x.x:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [4620964]]; ]
shard [_na], reason [RemoteTransportException[[es-data-nqa-normal-elastic-search-2-1d][inet[/x.x.x.x:9300]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [150346352]]; ]

    at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.throwOnShardFailures(ESScan.java:167)
    at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.doNext(ESScan.java:132)
    at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.next(ESScan.java:123)
    at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.next(ESScan.java:73)
    at

nitishgoyal13 on 6 Aug 2017

Most likely you were running a scroll search and waited for too long between two consecutive calls and the context was deleted on the Elasticsearch side in order to reclaim resources. It is unlikely to be related to the upgrade.

jpountz on 8 Aug 2017

I am migrating es cluster from 1.4 to 5.4.

Still getting the same error
[indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [4003744]
Successfully migrated the dev and testing cluster.
Getting the error while migrating prod cluster.

I have increased the scroll time to 30 mins and decreased the bulk size to 100. I have run the migration multiple times. Every time, after running for few hours, it fails with the below error

error:org.elasticsearch.ElasticsearchException: 4 shards failed while scrolling -
shard [_na], reason [RemoteTransportException[[es-data-elastic-paid1-30-1e][inet[/x.x.x.x:9200]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [4003744]]; ]
shard [_na], reason [RemoteTransportException[[es-data-elastic-paid1-38-1e][inet[/y.y.y.y:9200]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [4126954]]; ]
shard [_na], reason [RemoteTransportException[[es-data-elastic-paid1-26-1e][inet[/u.u.u.u:9200]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [3512259]]; ]
shard [_na], reason [RemoteTransportException[[es-data-elastic-paid1-28-1e][inet[/z.z.z.z:9200]][indices:data/read/search[phase/scan/scroll]]]; nested: SearchContextMissingException[No search context found for id [3821269]]; ]
at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.throwOnShardFailures(ESScan.java:167)
at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.doNext(ESScan.java:132)
at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.next(ESScan.java:123)
at com.spr.search.repository.elasticsearch.ESScan$ESScanIterator.next(ESScan.java:73)
at

nitishgoyal13 on 28 Aug 2017

     Scan<SearchHit> scroll = new ESScan<>(new Lookup<ESClientDetails>() {
        @Override
        public ESClientDetails lookup() {
            return finalFromEsClient;
        }
    }, new Adapter<SearchHit, SearchHit>() {
        @Override
        public SearchHit adapt(SearchHit input) {
            return input;
        }
    }, searchRequestBuilder, bulkIndexSize);


    for (final List<SearchHit> searchHits : scroll) {
             // Do precessing
     }

nitishgoyal13 on 29 Aug 2017

This snippet does not help since I have no idea what the ESScan class does. I'd suggest that you add logging to your application in order to report on intervals between calls to the scroll API, and also review your code in order to make sure that you use the scroll ID returned by the last scroll and not always the initial scroll ID.

jpountz on 29 Aug 2017

I'm experiencing the same issue on 5.6.4 during a bulk search composed of several match queries, over an index that is not being written to. Most of the queries in the batch end up timing out (after 1h).

[2019-01-08T22:39:00,784][INFO ][o.e.n.Node ] [esnode1] started [2019-01-08T22:39:00,999][INFO ][o.e.g.GatewayService ] [esnode1] recovered [1] indices into cluster_state [2019-01-08T22:39:10,035][INFO ][o.e.c.r.a.AllocationService] [esnode1] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[myIndex][3]] ...]). [2019-01-08T22:40:32,824][INFO ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][95] overhead, spent [253ms] collecting in the last [1s] [2019-01-08T23:04:20,700][DEBUG][o.e.a.s.TransportSearchAction] [esnode1] [919] Failed to execute fetch phase org.elasticsearch.transport.RemoteTransportException: [esnode1][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [919] at org.elasticsearch.search.SearchService.findContext(SearchService.java:447) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:414) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:407) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:404) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.4.jar:5.6.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [2019-01-08T23:06:06,766][DEBUG][o.e.a.s.TransportSearchAction] [esnode1] [775] Failed to execute fetch phase org.elasticsearch.transport.RemoteTransportException: [esnode1][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [775] at org.elasticsearch.search.SearchService.findContext(SearchService.java:447) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:414) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:407) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:404) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.4.jar:5.6.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [2019-01-08T23:06:10,257][DEBUG][o.e.a.s.TransportSearchAction] [esnode1] [675] Failed to execute fetch phase org.elasticsearch.transport.RemoteTransportException: [esnode1][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [675] at org.elasticsearch.search.SearchService.findContext(SearchService.java:447) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:414) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:407) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:404) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.4.jar:5.6.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [2019-01-08T23:07:32,316][DEBUG][o.e.a.s.TransportSearchAction] [esnode1] [927] Failed to execute fetch phase org.elasticsearch.transport.RemoteTransportException: [esnode1][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [927] at org.elasticsearch.search.SearchService.findContext(SearchService.java:447) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:414) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:407) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:404) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.4.jar:5.6.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [2019-01-08T23:09:21,053][DEBUG][o.e.a.s.TransportSearchAction] [esnode1] [1047] Failed to execute fetch phase org.elasticsearch.transport.RemoteTransportException: [esnode1][127.0.0.1:9300][indices:data/read/search[phase/fetch/id]] Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [1047] at org.elasticsearch.search.SearchService.findContext(SearchService.java:447) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:414) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:407) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.action.search.SearchTransportService$12.messageReceived(SearchTransportService.java:404) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:654) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.4.jar:5.6.4] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.4.jar:5.6.4] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] [2019-01-08T23:11:26,716][WARN ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][young][1940][74] duration [3.8s], collections [1]/[4.9s], total [3.8s]/[9s], memory [1.3gb]->[1gb]/[7.9gb], all_pools {[young] [266.2mb]->[17mb]/[266.2mb]}{[survivor] [1.8mb]->[1.6mb]/[33.2mb]}{[old] [1gb]->[1gb]/[7.6gb]} [2019-01-08T23:11:26,718][WARN ][o.e.m.j.JvmGcMonitorService] [esnode1] [gc][1940] overhead, spent [3.8s] collecting in the last [4.9s]

Cluster stats seem ok:
{ "_nodes" : { "total" : 1, "successful" : 1, "failed" : 0 }, "cluster_name" : "escluster1", "timestamp" : 1546987424259, "status" : "green", "indices" : { "count" : 1, "shards" : { "total" : 10, "primaries" : 10, "replication" : 0.0, "index" : { "shards" : { "min" : 10, "max" : 10, "avg" : 10.0 }, "primaries" : { "min" : 10, "max" : 10, "avg" : 10.0 }, "replication" : { "min" : 0.0, "max" : 0.0, "avg" : 0.0 } } }, "docs" : { "count" : 5746677, "deleted" : 0 }, "store" : { "size" : "181.5gb", "size_in_bytes" : 194959969930, "throttle_time" : "0s", "throttle_time_in_millis" : 0 }, "fielddata" : { "memory_size" : "0b", "memory_size_in_bytes" : 0, "evictions" : 0 }, "query_cache" : { "memory_size" : "0b", "memory_size_in_bytes" : 0, "total_count" : 0, "hit_count" : 0, "miss_count" : 0, "cache_size" : 0, "cache_count" : 0, "evictions" : 0 }, "completion" : { "size" : "0b", "size_in_bytes" : 0 }, "segments" : { "count" : 293, "memory" : "897.2mb", "memory_in_bytes" : 940811637, "terms_memory" : "893.8mb", "terms_memory_in_bytes" : 937313273, "stored_fields_memory" : "3.2mb", "stored_fields_memory_in_bytes" : 3415152, "term_vectors_memory" : "0b", "term_vectors_memory_in_bytes" : 0, "norms_memory" : "54.9kb", "norms_memory_in_bytes" : 56256, "points_memory" : "0b", "points_memory_in_bytes" : 0, "doc_values_memory" : "26.3kb", "doc_values_memory_in_bytes" : 26956, "index_writer_memory" : "0b", "index_writer_memory_in_bytes" : 0, "version_map_memory" : "0b", "version_map_memory_in_bytes" : 0, "fixed_bit_set" : "0b", "fixed_bit_set_memory_in_bytes" : 0, "max_unsafe_auto_id_timestamp" : -1, "file_sizes" : { } } }, "nodes" : { "count" : { "total" : 1, "data" : 1, "coordinating_only" : 0, "master" : 1, "ingest" : 1 }, "versions" : [ "5.6.4" ], "os" : { "available_processors" : 4, "allocated_processors" : 4, "names" : [ { "name" : "Linux", "count" : 1 } ], "mem" : { "total" : "15.3gb", "total_in_bytes" : 16430301184, "free" : "171.7mb", "free_in_bytes" : 180113408, "used" : "15.1gb", "used_in_bytes" : 16250187776, "free_percent" : 1, "used_percent" : 99 } }, "process" : { "cpu" : { "percent" : 7 }, "open_file_descriptors" : { "min" : 214, "max" : 214, "avg" : 214 } }, "jvm" : { "max_uptime" : "1h", "max_uptime_in_millis" : 3900907, "versions" : [ { "version" : "1.8.0_191", "vm_name" : "OpenJDK 64-Bit Server VM", "vm_version" : "25.191-b12", "vm_vendor" : "Oracle Corporation", "count" : 1 } ], "mem" : { "heap_used" : "1gb", "heap_used_in_bytes" : 1154171376, "heap_max" : "7.9gb", "heap_max_in_bytes" : 8555069440 }, "threads" : 45 }, "fs" : { "total" : "294.9gb", "total_in_bytes" : 316734971904, "free" : "113.3gb", "free_in_bytes" : 121704624128, "available" : "98.3gb", "available_in_bytes" : 105591787520, "spins" : "true" }, "plugins" : [ ], "network_types" : { "transport_types" : { "netty4" : 1 }, "http_types" : { "netty4" : 1 } } } }