Elasticsearch version: 5.5.0 (issue first appeared while on 5.4.1)
Plugins installed: [x-pack, repository-s3]
JVM version (java -version): openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version (uname -a if on a Unix-like system): Linux ip-10-127-1-159 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Trying to get the list of available snapshots on an S3 backed repository fails with NullPointerException.
curl elasticsearch:9200/_snapshot/long_term/_all
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]"}],"type":"null_pointer_exception","reason":null},"status":500}
Elasticsearch logs:
[2017-07-25T12:01:47,038][WARN ][r.suppressed ] path: /_snapshot/long_term/_all, params: {repository=long_term, snapshot=_all}
org.elasticsearch.transport.RemoteTransportException: [SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]
Caused by: java.lang.NullPointerException
I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:
2017-07-25 11:53:02,191 ERROR Failed to complete action: delete_snapshots. <class 'curator.exceptions.FailedExecution'>: Unable to get snapshot information from repository: long_term. Error: TransportError(500, 'null_pointer_exception', '[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]')
I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?
In the past getting a list of snapshots took increasingly long but it eventually responded. Now it breaks earlier than that.
Also posted on the forums: https://discuss.elastic.co/t/nullpointerexception-when-getting-list-of-snapshots-on-s3/94458
Could you paste the full stack trace from the Elasticsearch server logs?
There's no more logs for the null pointer entry. There's a ton of logs for the headers and each of the 2K requests do you want me to post those? All of those responded with 200 OK though.
These should be the logs from the last request before the null pointer. I tried to sensor out any possibly sensitive info. Maybe the returned payload was what triggered the issue?
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAddCookies] CookieSpec selected: default
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAuthCache] Auth cache not set in the context
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestProxyAuthentication] Proxy auth state: UNCHALLENGED
[2017-07-25T12:27:45,437][DEBUG][c.a.h.i.c.SdkHttpClient ] Attempt 1 to execute request
[2017-07-25T12:27:45,437][DEBUG][o.a.h.i.c.DefaultClientConnection] Sending request: GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "x-amz-content-sha256: *********************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-d
ate;x-amz-security-token, Signature=***************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "X-Amz-Date: 20170725T092745Z[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "X-Amz-Security-Token: **********************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "amz-sdk-retry: 0/0/[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "Connection: Keep-Alive[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire ] >> "[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-dat
e;x-amz-security-token, Signature=***************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> X-Amz-Date: 20170725T092745Z
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> X-Amz-Security-Token: **********************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> amz-sdk-retry: 0/0/
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> Content-Type: application/octet-stream
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers ] >> Connection: Keep-Alive
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "HTTP/1.1 200 OK[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "x-amz-id-2: ************************[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "x-amz-request-id: 3E117E943CA08991[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Date: Tue, 25 Jul 2017 09:27:46 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "ETag: "8e87c087b7474433ba26057f74233e5a"[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Accept-Ranges: bytes[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Content-Length: 302[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "Server: AmazonS3[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << x-amz-id-2: *************************************
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << x-amz-request-id: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Date: Tue, 25 Jul 2017 09:27:46 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << ETag: "8e87c087b7474433ba26057f74233e5a"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Accept-Ranges: bytes
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Content-Type: application/octet-stream
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Content-Length: 302
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers ] << Server: AmazonS3
[2017-07-25T12:27:45,509][DEBUG][c.a.h.i.c.SdkHttpClient ] Connection can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,509][DEBUG][c.a.request ] Received successful response: 200, AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId ] AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire ] << "?[0xd7]l[0x17][0x8]snapshot[0x0][0x0][0x0][0x1]DFL[0x0]l[0x92][0xcd]J[0x3]1[0x14][0x85][0xe9][0xc2][0x85][0xe2]SH]t[0xd1]![0xc9]Lg[0xda][0xee]tP[0x17]B[0x17][0xa6][0xed]B[0x90]!4[0x19][0x9a][0xd2]fln[0xd2][0x95]++[0xe]E[0x90]y[0xdc][0xfe]l[0x1c]DQ[0xe8][0x85]lr[0xf8][0xce][0xb9][0xf7][0x90][0xf4][g'[0xfb][0x12][0x8c]x[0x86]i[0xe1][0xf6]k#[0x16][0xea]IHY[0x18]h[0xff][0xaa]mFhB[0x12][0xda]#[0x94][0xc4][0x9d]h[0xed][0xbd][0x96][0xa3][0xbb][0x7];[0xec][0xa6][0xf7]3[0x9e],[0xe5]2b[0x83][0xc7]<[0x1c][0xb2][0xab][0xcd]JY[0xd0][0x85][0xc9][0xb4]l[0x9e][0xe][0xae]?[0xdf][0xb5][0x91]z[0xa2]`[0xcb]^?2W[0xf4];- [0xf5][0x89][0x10][0x91]v02[0xc1]H[0x8a][0x91]=[0x8c]D[0xed]1f?[0x9e][0x1e][0x7][0xec]83[0xe]B[0x82][0xd9][0x19]&v[0xb1][0x95][0xd0][0xee][0x18]I[0xb0][0x9a][0x14][0x9b]NCl:Z[0x13]#)[0xdb][0xbd][0x81][0x13]N[0xdd][0xf2]Q[0x9a][0xde]p[0xbe][0xa9]o[0xd6]eN/[0xd4]e#[0x18][0x9f]_[0xbc][0xbc][0x96][0xca][0xc8]?[0xa1]e[0xaa][0xf]W81[0xcf]`*[0xac][0x84]f[0xa3][0xaa][0xc0]O[0xea][0xe7][0x86][0xdc][0xff][0x13][0xcb]\[0xe8][0xb9][0xb7][0xf5]/[0xd8][0x1d][0xe]_[0x0][0x0][0x0][0xff][0xff][0x3][0x0][0xc0]([0x93][0xe8][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0xf4][0x1f]J[0xbe]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-*************.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-******************.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
FYI we use a coordinating node and 3 data nodes. I do the snapshot requests to the coordinating node, and all the S3 requests seem to originate from the data node that's currently the master (10.127.1.203).
Some more logs:
I see ~ 1k of these logs 15 sec after start of the request and ~ 500 at the end:
[2017-07-25T12:27:46,968][DEBUG][o.e.s.SearchService ] [SVVyQPF] freeing search context [1977515], time [225057509], lastAccessTime [224978476], keepAlive [30000]
These pop up between requests:
[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
These are the things logged on the master node around the time the coordinating node logged the exception (excluding the freeing search context logs mentioned above):
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-***********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,541][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221514] overhead, spent [106ms] collecting in the last [1s]
[2017-07-25T12:27:47,497][DEBUG][o.e.x.m.a.GetDatafeedsStatsAction$TransportAction] [SVVyQPF] Get stats for datafeed '_all'
[2017-07-25T12:27:47,652][DEBUG][o.e.x.m.e.l.LocalExporter] monitoring index templates and pipelines are installed on master node, service can start
[2017-07-25T12:27:48,542][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221517] overhead, spent [111ms] collecting in the last [1s]
Hmm, I don't see any smoking gun here. I am not really sure how to move forward with this without knowing where this NPE occurs or being able to reproduce this issue locally.
Ok as I understand it there should have been a stack trace after the "caused by" line right? Maybe we can look into why that's not present and then we'll have more info for the specific issue? Also there's that r.suppressed thing. That would at least point the to class in which the NPE occurred but that's not available either. Can I configure something to make that visible?
@eirc you said that
These should be the logs from the last request before the null pointer
but the timestamp from these logs are 12:27 whereas the NPE has a timestamp of 12:01.
Can you provide the full logs from both the master node and the coordinating node? (You can share them in private with us if you don't want to post them publicly)
@eirc, @ywelsch and I discussed this more and we have a couple of other things we would like you to try:
1) could you execute curl elasticsearch:9200/_snapshot/long_term/_all?error_trace=true and see if the stack trace shows up there
2) could you execute curl localhost:9200/_snapshot/long_term/_all on the current master node. And if it works, but still fails when you execute it against a coordinating node we would really appreciate this output as well.
Regarding the time discrepancies, the NPE happens every time I request a listing. At 12:27 I had debug logging on so that's why most of the logs are from that time. At 12:01 was probably one of the first tests. The same NPE log appeared at 12:27 and every time I did a listing request.
Ok now there's some light at the end of the tunnel!
First if I get the listing from the master node it actually works! By requesting on the coordinating (or any other) node it fails with that same behaviour. Adding error_trace=true to the request yields some useful info finally:
{
"error": {
"root_cause": [{
"type": "remote_transport_exception",
"reason": "[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]",
"stack_trace": "[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)\n\tat org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94)\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.handleException(TransportMasterNodeAction.java:185)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067)\n\tat org.elasticsearch.transport.TcpTransport.lambda$handleException$16(TcpTransport.java:1467)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110)\n\tat org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)\n\tat org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)\n\tat org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\nCaused by: java.lang.NullPointerException\n"
}],
"type": "null_pointer_exception",
"reason": null,
"stack_trace": "java.lang.NullPointerException\n"
},
"status": 500
}
Here's the formatted stack trace for your convenience:
[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)
at org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)
at org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)
at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)
at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)
at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)
at org.elasticsearch.action.support.TransportAction.onFailure(TransportAction.java:94)
at org.elasticsearch.action.support.master.TransportMasterNodeAction.handleException(TransportMasterNodeAction.java:185)
at org.elasticsearch.transport.TransportService.handleException(TransportService.java:1067)
at org.elasticsearch.transport.TcpTransport.lambda(TcpTransport.java:1467)
at org.elasticsearch.common.util.concurrent.EsExecutors.execute(EsExecutors.java:110)
at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)
at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel.read(AbstractNioByteChannel.java:134)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:748)
Caused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException
@eirc any chance you can email me the output that you get from master? My mail is igor at elastic.co. If not, could you try getting one snapshot at a time on the coordinating node and checking what's different between the snapshots that can be retrieved and the snapshots that cause this NPE?
By the way, does the coordinating node have a different es version?
Just confirmed all elasticsearches are on 5.5.0. Can I check the version of plugins someway? When I upgraded the stack I remember I had to remove and reinstall plugins to be of proper versions.
I'll make a script to pull each snapshot individually and see which one(s) are breaking now.
In 5.5.0 all plugins should be 5.5.0. Otherwise, elasticsearch wouldn't work. In any case, based on what we know so far, I don't think it's a plugin-related issue. Our current theory is that snapshot info serialization code breaks on one or more snapshots that you have in your repository. However, we just reviewed this code and couldn't find any obvious issues. That's why we would like to figure out which snapshot information master is trying to send to the coordinating node in order to reproduce and fix the problem.
I emailed you the full snapshot list. My script ~managed to successfully grab each snapshot individually from the coordinating node~ (where grabbing them all failed). I noticed some of the snapshots have some shard failures but that shouldn't be an issue right? Maybe it's the size of the response that's the issue here? I got ~2k snapshots and the response is 1.2 MB.
No scratch that, there is a single snapshot which produces the NPE when I get it on it's own.
Here is the JSON I can get from the master but not from other nodes:
{
"snapshots": [
{
"snapshot": "wsj-snapshot-20170720085856",
"uuid": "yIbELYjgQN-_BgjRd4Vb0A",
"version_id": 5040199,
"version": "5.4.1",
"indices": [
"wsj-2017.07.19",
"wsj-iis-2017.07.11",
"wsj-2017.07.08",
"wsj-2017.07.15",
"wsj-2017.07.11",
"wsj-2017.07.12",
"wsj-2017.07.02",
"wsj-2017.07.10",
"wsj-2017.07.06",
"wsj-2017.06.30",
"wsj-2017.07.05",
"wsj-2017.07.14",
"wsj-2017.07.03",
"wsj-2017.07.16",
"wsj-2017.07.17",
"wsj-2017.07.07",
"wsj-2017.07.01",
"wsj-2017.07.09",
"wsj-2017.07.04",
"wsj-2017.07.18",
"wsj-2017.07.13"
],
"state": "PARTIAL",
"start_time": "2017-07-20T08:58:57.243Z",
"start_time_in_millis": 1500541137243,
"end_time": "2017-07-20T11:52:37.938Z",
"end_time_in_millis": 1500551557938,
"duration_in_millis": 10420695,
"failures": [
{
"index": "wsj-2017.07.16",
"index_uuid": "wsj-2017.07.16",
"shard_id": 0,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.18",
"index_uuid": "wsj-2017.07.18",
"shard_id": 1,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.06.30",
"index_uuid": "wsj-2017.06.30",
"shard_id": 0,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-iis-2017.07.11",
"index_uuid": "wsj-iis-2017.07.11",
"shard_id": 4,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.18",
"index_uuid": "wsj-2017.07.18",
"shard_id": 0,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.06",
"index_uuid": "wsj-2017.07.06",
"shard_id": 0,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-iis-2017.07.11",
"index_uuid": "wsj-iis-2017.07.11",
"shard_id": 0,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.19",
"index_uuid": "wsj-2017.07.19",
"shard_id": 4,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.03",
"index_uuid": "wsj-2017.07.03",
"shard_id": 4,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-iis-2017.07.11",
"index_uuid": "wsj-iis-2017.07.11",
"shard_id": 3,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.03",
"index_uuid": "wsj-2017.07.03",
"shard_id": 0,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.16",
"index_uuid": "wsj-2017.07.16",
"shard_id": 3,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.01",
"index_uuid": "wsj-2017.07.01",
"shard_id": 1,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.18",
"index_uuid": "wsj-2017.07.18",
"shard_id": 4,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.05",
"index_uuid": "wsj-2017.07.05",
"shard_id": 4,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.05",
"index_uuid": "wsj-2017.07.05",
"shard_id": 1,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.06.30",
"index_uuid": "wsj-2017.06.30",
"shard_id": 1,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.06.30",
"index_uuid": "wsj-2017.06.30",
"shard_id": 4,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.18",
"index_uuid": "wsj-2017.07.18",
"shard_id": 3,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.01",
"index_uuid": "wsj-2017.07.01",
"shard_id": 4,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.03",
"index_uuid": "wsj-2017.07.03",
"shard_id": 3,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-iis-2017.07.11",
"index_uuid": "wsj-iis-2017.07.11",
"shard_id": 1,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.19",
"index_uuid": "wsj-2017.07.19",
"shard_id": 0,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.19",
"index_uuid": "wsj-2017.07.19",
"shard_id": 3,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.19",
"index_uuid": "wsj-2017.07.19",
"shard_id": 1,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.16",
"index_uuid": "wsj-2017.07.16",
"shard_id": 4,
"reason": "IndexNotFoundException[no such index]",
"node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "wsj-2017.07.03",
"index_uuid": "wsj-2017.07.03",
"shard_id": 1,
"reason": null,
"node_id": "eIcWA_QQTByXWDrUlUOFAA",
"status": "INTERNAL_SERVER_ERROR"
}
],
"shards": {
"total": 27,
"failed": 27,
"successful": 0
}
}
]
}
Interestingly this snapshot includes that wsj-iis-2017.07.11 index which we then deleted (since due to the naming it would get mixed up a lot with the wsj-* indices) and recreated with another name. Those IndexNotFoundException[no such index] messages look weird though because the mentioned indices do exist, are still on the cluster and I can query them.
馃弳 deleted the offending snapshot and the listing now works! 馃
If you need any more info on the "bug" itself I'll be happy to provide. Also my issue is solved but I'll leave this for you to close in case you want to follow the thread deeper.
Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it!
@eirc I spent some time trying to reproduce the issue, but no matter what I try I cannot get my snapshot into the state where it produces nulls in shard failures. It looks like the snapshot in question took place a week ago. Do you remember, by any chance, what was going on with the cluster during this time? Do you still have log files from that day?
My current best guess is that that index I mentioned we deleted (wsj-iis) was deleted during the backup process and maybe that mucked up things somehow. I can check the logs at the time for more concrete info but that has to until tomorrow when i get back to work :)
Yes, deletion of indices during a snapshot is the first thing I tried. It is producing a slightly different snapshot info that doesn't contain any nulls. It seems that I am missing some key ingredient here. I am done for today as well, but it would be awesome if you could check the logs tomorrow.
The issue I see is that the code incorrectly assumes that reason is non-null in case where there is a SnapshotShardFailure. The failure is constructed from a ShardSnapshotStatus object that is in a "failed" state (one of FAILED, ABORTED, MISSING). I see two places where we can possibly have a ShardSnapshotStatus object with "failed" state and where the "reason" can be null:
reason != null to the SnapshotShardFailure constructor and running the (currently disabled) test testMasterShutdownDuringFailedSnapshot a few times.shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED)) when aborting a snapshot. Here it's more difficult to come up with a scenario. But unless we can rule that one out, I would still consider it an issue.I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?
Seems like that index was actually deleted a few days later after all so that was probably a red herring.
Ok there's a huge spike of logs during that snapshot's creation time, I'll try to aggregate what I see as most important:
29 occurrences of
[2017-07-20T14:44:49,461][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-iis-2017.07.11][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:397) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1446) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.index.shard.IndexShard.acquireIndexCommit(IndexShard.java:836) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-5.4.1.jar:5.4.1]
... 7 more
and 2 of
[2017-07-20T14:44:49,459][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-2017.07.19][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1501) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1494) ~[elasticsearch-5.4.1.jar:5.4.1]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_131]
at org.elasticsearch.common.io.Streams.copy(Streams.java:76) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.io.Streams.copy(Streams.java:57) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:100) ~[?:?]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1428) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1370) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:967) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
1702 occurrences of the following from one data node:
[2017-07-20T14:51:22,103][WARN ][o.e.c.u.IndexFolderUpgrader] [/mnt/elasticsearch-data-02/nodes/0/indices/8oH-hwzeQAmJR7TZkUxf1w] no index state found - ignoring
and one similar from another host
a spike of ~ 2.5k of those at the start of the snapshot:
[2017-07-20T14:44:48,526][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: NodeClosedException[node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}
... 15 more
and a similar number of those at the end of the snapshot:
[2017-07-20T14:51:05,408][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: TransportException[transport stopped, action: indices:data/write/bulk[s][p]]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:783) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:247) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/write/bulk[s][p]
at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
... 5 more
got 9 of those with at least one for each node
[2017-07-20T14:44:47,437][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [zYawxs4] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [Ht8LDxXGQAGEna893aC57w]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:493) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:204) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesInfo(AbstractClient.java:811) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.rest.action.admin.cluster.RestNodesInfoAction.lambda$prepareRequest$0(RestNodesInfoAction.java:109) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:80) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:260) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:199) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:505) ~[transport-netty4-5.4.1.jar:5.4.1]
at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:72) ~[transport-netty4-5.4.1.jar:5.4.1]
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:63) ~[transport-netty4-5.4.1.jar:5.4.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.11.Final.jar:4.1.11.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Ht8LDxX][10.127.1.207:9300] Node not connected
at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:630) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:116) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:489) ~[elasticsearch-5.4.1.jar:5.4.1]
... 58 more
3 of those
[2017-07-20T14:44:48,536][WARN ][o.e.x.m.MonitoringService] [Ht8LDxX] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:138) ~[?:?]
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
... 27 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:127) ~[?:?]
... 25 more
I'm sure there's more stuff in there but I don't know if this actually helps you because I can't make sense of it due to the log volume being that high so I may be missing the important ones. Is there something more specific I could search for that would help? Maybe I should just export all elasticsearch logs for that day and mail them to you?
I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?
@ywelsch Yes fixing it like this would be easy, I just didn't want to assume anything, I wanted to have a test that creates this problem so we can fix it for sure. So, that's why I spent some time trying to reproduce it. You are right about it being null in SnapshotsInProgress, and I tried to reproduce it this way but it looks like it's a completely different path that doesn't get resolved into shard failure object, so this seems to be a dead end. So, I think ABORTED path is more promising and after thinking about for a while, I think the scenario is snapshot gets stuck on a master, gets aborted, then another master takes over, and somehow generates these nulls. The problem with this scenario is that if a snapshot is aborted, it should be deleted afterwards. So, based on the information that @eirc provided, it looks like it might be a combination of stuck snapshot combined with some sort of node failure that prevented the aborted snapshot from being cleaned up, which might be quite difficult to reproduce.
Maybe I should just export all elasticsearch logs for that day and mail them to you?
@eirc that would be very helpful. Thanks!
Just a quick update. @ywelsch and I discussed the issue and came up with a plan how to modify testMasterShutdownDuringFailedSnapshot to potentially reproduce the issue. I will try implementing it.
Most helpful comment
Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it!