Elasticsearch version (bin/elasticsearch --version): 7.2.0, but can be reproduced for previous versions
Description of the problem including expected versus actual behavior:
Elasticsearch is not waiting for already opened connections (search requests) to be completed, so when SIGTERM is sent active connections will be dropped.
Expected behavior: complete already open connections
Actual: all active connection will be dropped, if elasticsearch is stopped during the execution:
curl -sS -XGET -o /dev/null -H 'Content-Type: application/json' http://search-client:9200/_search -d'....'
I'm getting:
curl: (52) Empty reply from server
Steps to reproduce:
Here is an initial issue https://github.com/elastic/helm-charts/issues/190,
reproduce steps can be found in this comment: https://github.com/elastic/helm-charts/issues/190#issuecomment-516458091:
1.
#!/usr/bin/env bash
CPUS="0.1"
SLEEP_TIME="5"
set -euo pipefail
function sleep_then_stop () {
sleep $SLEEP_TIME
docker stop -t 60 elasticsearch
docker logs --since 30s elasticsearch
}
function query_es () {
time curl -XGET -o /dev/null -H 'Content-Type: application/json' http://localhost:9200/_search -d'
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"should" : [
{ "regexp" : { "ip" : ".<1-92>*<1-5><0-7>*" } },
{ "regexp" : { "agent" : ".oz*5*" } },
{ "regexp" : { "referer" : "..tp?://.*scom|org" } }
]
}
},
"sort": [
{ "utc_time": { "order": "desc" }}
],
"size": 10000
}'
}
if docker inspect elasticsearch; then
docker start elasticsearch
else
docker run --name elasticsearch -d -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.2.0
fi
docker update --cpus 2 elasticsearch
until curl localhost:9200; do sleep 1; done
if ! [ -f "logs.jsonl" ] ; then
curl -O https://download.elastic.co/demos/kibana/gettingstarted/8.x/logs.jsonl.gz
gunzip logs.jsonl.gz
fi
if ! curl localhost:9200/_cat/indices | grep logstash; then
curl -H 'Content-Type: application/x-ndjson' -XPOST 'localhost:9200/_bulk?pretty' --data-binary @logs.jsonl
fi
echo "*********************"
echo "Before CPU limit"
query_es
echo "*********************"
echo
docker update --cpus $CPUS elasticsearch
echo "*********************"
echo "After CPU limit"
query_es
echo "*********************"
echo "*********************"
echo "After CPU limit with graceful stop after $SLEEP_TIME seconds"
sleep_then_stop &
query_es
echo "*********************"
sleep_then_stop cause curl: (52) Empty reply from serverProvide logs (if relevant):
{"type": "server", "timestamp": "2019-07-30T14:52:13,915+0000", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "[gc][20] overhead, spent [501ms] collecting in the last [1.5s]" }
{"type": "server", "timestamp": "2019-07-30T14:52:25,717+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopping ..." }
{"type": "server", "timestamp": "2019-07-30T14:52:26,418+0000", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "flood stage disk watermark [95%] exceeded on [WmLyycgDR32y1ULuaD1npw][40a5dff89f7f][/usr/share/elasticsearch/data/nodes/0] free: 1.6gb[2.8%], all indices on this node will be marked read-only" }
{"type": "server", "timestamp": "2019-07-30T14:52:27,216+0000", "level": "INFO", "component": "o.e.x.w.WatcherService", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopping watch service, reason [shutdown initiated]" }
{"type": "server", "timestamp": "2019-07-30T14:52:27,614+0000", "level": "INFO", "component": "o.e.x.m.p.l.CppLogMessageHandler", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "[controller/101] [Main.cc@150] Ml controller exiting" }
{"type": "server", "timestamp": "2019-07-30T14:52:27,813+0000", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "Native controller process has stopped - no new native processes can be started" }
{"type": "server", "timestamp": "2019-07-30T14:52:30,021+0000", "level": "ERROR", "component": "i.n.u.c.D.rejectedExecution", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "Failed to submit a listener notification task. Event loop shut down?" ,
"stacktrace": ["java.util.concurrent.RejectedExecutionException: event executor terminated",
"at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:846) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:329) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:322) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:757) ~[netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.DefaultPromise.safeExecute(DefaultPromise.java:766) [netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:423) [netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540) [netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533) [netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.util.concurrent.DefaultPromise.setFailure(DefaultPromise.java:106) [netty-common-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.DefaultChannelPromise.setFailure(DefaultChannelPromise.java:89) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.safeExecute(AbstractChannelHandlerContext.java:1020) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:805) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:762) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1031) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:310) [netty-transport-4.1.35.Final.jar:4.1.35.Final]",
"at org.elasticsearch.http.netty4.Netty4HttpChannel.sendResponse(Netty4HttpChannel.java:43) [transport-netty4-client-7.2.0.jar:7.2.0]",
"at org.elasticsearch.http.DefaultRestChannel.sendResponse(DefaultRestChannel.java:123) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:498) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.rest.action.RestActionListener.onResponse(RestActionListener.java:47) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:316) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.onResponse(AbstractSearchAsyncAction.java:51) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:166) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:120) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:166) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.CountedCollector.countDown(CountedCollector.java:53) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.CountedCollector.onResult(CountedCollector.java:64) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:166) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:163) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:45) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:29) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:432) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.transport.TransportService$DirectResponseChannel.processResponse(TransportService.java:1182) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1162) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:47) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:30) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.search.SearchService$1.doRun(SearchService.java:347) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:758) [elasticsearch-7.2.0.jar:7.2.0]",
"at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.2.0.jar:7.2.0]",
"at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]",
"at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]",
"at java.lang.Thread.run(Thread.java:835) [?:?]"] }
{"type": "server", "timestamp": "2019-07-30T14:52:31,517+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "stopped" }
{"type": "server", "timestamp": "2019-07-30T14:52:31,518+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "closing ..." }
{"type": "server", "timestamp": "2019-07-30T14:52:32,115+0000", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "docker-cluster", "node.name": "40a5dff89f7f", "cluster.uuid": "ErXDUGkORl--NnJwiYFINw", "node.id": "WmLyycgDR32y1ULuaD1npw", "message": "closed" }
Pinging @elastic/es-core-infra
Pinging @elastic/es-docs
This was discussed in FixitFriday. We don't think that Elasticsearch should keep processing any requests after a SIGTERM has been issued, be they search, index or snapshot requests. We have some internal logic that keeps some tasks running after a SIGTERM has been issued but the goal is only to clean things up properly, not to complete ongoing requests.
We should better document guarantees regarding shutdown of the Elasticsearch process:
[doc issue triage]
It still seems like it would be good to add or improve documentation on node shutdown. It looks to me like adding a paragraph on the "Stopping Elasticsearch" page would be the way to go here, since that page mentions using SIGTERM to stop the Elasticsearch process.
Most helpful comment
This was discussed in FixitFriday. We don't think that Elasticsearch should keep processing any requests after a SIGTERM has been issued, be they search, index or snapshot requests. We have some internal logic that keeps some tasks running after a SIGTERM has been issued but the goal is only to clean things up properly, not to complete ongoing requests.
We should better document guarantees regarding shutdown of the Elasticsearch process: