elasticsearch 🚀 - NullPointerException when getting list of snapshots on S3

Could you paste the full stack trace from the Elasticsearch server logs?

colings86 on 25 Jul 2017

There's no more logs for the null pointer entry. There's a ton of logs for the headers and each of the 2K requests do you want me to post those? All of those responded with 200 OK though.

eirc on 25 Jul 2017

These should be the logs from the last request before the null pointer. I tried to sensor out any possibly sensitive info. Maybe the returned payload was what triggered the issue?

[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAddCookies] CookieSpec selected: default
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAuthCache] Auth cache not set in the context
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestProxyAuthentication] Proxy auth state: UNCHALLENGED
[2017-07-25T12:27:45,437][DEBUG][c.a.h.i.c.SdkHttpClient  ] Attempt 1 to execute request
[2017-07-25T12:27:45,437][DEBUG][o.a.h.i.c.DefaultClientConnection] Sending request: GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "x-amz-content-sha256: *********************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-d
ate;x-amz-security-token, Signature=***************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Date: 20170725T092745Z[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Security-Token: **********************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-retry: 0/0/[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Connection: Keep-Alive[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-dat
e;x-amz-security-token, Signature=***************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Date: 20170725T092745Z
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Security-Token: **********************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-retry: 0/0/
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Content-Type: application/octet-stream
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Connection: Keep-Alive
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "HTTP/1.1 200 OK[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-id-2: ************************[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-request-id: 3E117E943CA08991[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Date: Tue, 25 Jul 2017 09:27:46 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "ETag: "8e87c087b7474433ba26057f74233e5a"[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Accept-Ranges: bytes[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Length: 302[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Server: AmazonS3[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-id-2: *************************************
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-request-id: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Date: Tue, 25 Jul 2017 09:27:46 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << ETag: "8e87c087b7474433ba26057f74233e5a"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Accept-Ranges: bytes
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Type: application/octet-stream
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Length: 302
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Server: AmazonS3
[2017-07-25T12:27:45,509][DEBUG][c.a.h.i.c.SdkHttpClient  ] Connection can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,509][DEBUG][c.a.request              ] Received successful response: 200, AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "?[0xd7]l[0x17][0x8]snapshot[0x0][0x0][0x0][0x1]DFL[0x0]l[0x92][0xcd]J[0x3]1[0x14][0x85][0xe9][0xc2][0x85][0xe2]SH]t[0xd1]![0xc9]Lg[0xda][0xee]tP[0x17]B[0x17][0xa6][0xed]B[0x90]!4[0x19][0x9a][0xd2]fln[0xd2][0x95]++[0xe]E[0x90]y[0xdc][0xfe]l[0x1c]DQ[0xe8][0x85]lr[0xf8][0xce][0xb9][0xf7][0x90][0xf4][g'[0xfb][0x12][0x8c]x[0x86]i[0xe1][0xf6]k#[0x16][0xea]IHY[0x18]h[0xff][0xaa]mFhB[0x12][0xda]#[0x94][0xc4][0x9d]h[0xed][0xbd][0x96][0xa3][0xbb][0x7];[0xec][0xa6][0xf7]3[0x9e],[0xe5]2b[0x83][0xc7]<[0x1c][0xb2][0xab][0xcd]JY[0xd0][0x85][0xc9][0xb4]l[0x9e][0xe][0xae]?[0xdf][0xb5][0x91]z[0xa2]`[0xcb]^?2W[0xf4];- [0xf5][0x89][0x10][0x91]v02[0xc1]H[0x8a][0x91]=[0x8c]D[0xed]1f?[0x9e][0x1e][0x7][0xec]83[0xe]B[0x82][0xd9][0x19]&v[0xb1][0x95][0xd0][0xee][0x18]I[0xb0][0x9a][0x14][0x9b]NCl:Z[0x13]#)[0xdb][0xbd][0x81][0x13]N[0xdd][0xf2]Q[0x9a][0xde]p[0xbe][0xa9]o[0xd6]eN/[0xd4]e#[0x18][0x9f]_[0xbc][0xbc][0x96][0xca][0xc8]?[0xa1]e[0xaa][0xf]W81[0xcf]`*[0xac][0x84]f[0xa3][0xaa][0xc0]O[0xea][0xe7][0x86][0xdc][0xff][0x13][0xcb]\[0xe8][0xb9][0xb7][0xf5]/[0xd8][0x1d][0xe]_[0x0][0x0][0x0][0xff][0xff][0x3][0x0][0xc0]([0x93][0xe8][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0xf4][0x1f]J[0xbe]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-*************.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-******************.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]

eirc on 25 Jul 2017

FYI we use a coordinating node and 3 data nodes. I do the snapshot requests to the coordinating node, and all the S3 requests seem to originate from the data node that's currently the master (10.127.1.203).

Some more logs:

I see ~ 1k of these logs 15 sec after start of the request and ~ 500 at the end:

[2017-07-25T12:27:46,968][DEBUG][o.e.s.SearchService      ] [SVVyQPF] freeing search context [1977515], time [225057509], lastAccessTime [224978476], keepAlive [30000]

These pop up between requests:

[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS

These are the things logged on the master node around the time the coordinating node logged the exception (excluding the freeing search context logs mentioned above):

[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-***********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,541][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221514] overhead, spent [106ms] collecting in the last [1s]
[2017-07-25T12:27:47,497][DEBUG][o.e.x.m.a.GetDatafeedsStatsAction$TransportAction] [SVVyQPF] Get stats for datafeed '_all'
[2017-07-25T12:27:47,652][DEBUG][o.e.x.m.e.l.LocalExporter] monitoring index templates and pipelines are installed on master node, service can start
[2017-07-25T12:27:48,542][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221517] overhead, spent [111ms] collecting in the last [1s]

eirc on 25 Jul 2017

Hmm, I don't see any smoking gun here. I am not really sure how to move forward with this without knowing where this NPE occurs or being able to reproduce this issue locally.

imotov on 26 Jul 2017

Ok as I understand it there should have been a stack trace after the "caused by" line right? Maybe we can look into why that's not present and then we'll have more info for the specific issue? Also there's that r.suppressed thing. That would at least point the to class in which the NPE occurred but that's not available either. Can I configure something to make that visible?

eirc on 26 Jul 2017

@eirc you said that

These should be the logs from the last request before the null pointer

but the timestamp from these logs are 12:27 whereas the NPE has a timestamp of 12:01.
Can you provide the full logs from both the master node and the coordinating node? (You can share them in private with us if you don't want to post them publicly)

ywelsch on 26 Jul 2017

@eirc, @ywelsch and I discussed this more and we have a couple of other things we would like you to try:

1) could you execute curl elasticsearch:9200/_snapshot/long_term/_all?error_trace=true and see if the stack trace shows up there

2) could you execute curl localhost:9200/_snapshot/long_term/_all on the current master node. And if it works, but still fails when you execute it against a coordinating node we would really appreciate this output as well.

imotov on 26 Jul 2017

Regarding the time discrepancies, the NPE happens every time I request a listing. At 12:27 I had debug logging on so that's why most of the logs are from that time. At 12:01 was probably one of the first tests. The same NPE log appeared at 12:27 and every time I did a listing request.

eirc on 26 Jul 2017

Ok now there's some light at the end of the tunnel!

First if I get the listing from the master node it actually works! By requesting on the coordinating (or any other) node it fails with that same behaviour. Adding error_trace=true to the request yields some useful info finally:

{
    "error": {
        "root_cause": [{
            "type": "remote_transport_exception",
            "reason": "[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]",
            "stack_trace": "[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)\n\tat org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94)\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.handleException(TransportMasterNodeAction.java:185)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067)\n\tat org.elasticsearch.transport.TcpTransport.lambda$handleException$16(TcpTransport.java:1467)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110)\n\tat org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)\n\tat org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)\n\tat org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\nCaused by: java.lang.NullPointerException\n"
        }],
        "type": "null_pointer_exception",
        "reason": null,
        "stack_trace": "java.lang.NullPointerException\n"
    },
    "status": 500
}

Here's the formatted stack trace for your convenience:

[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
        at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)
        at org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)
        at org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)
        at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)
        at org.elasticsearch.action.support.TransportAction.onFailure(TransportAction.java:94)
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.handleException(TransportMasterNodeAction.java:185)
        at org.elasticsearch.transport.TransportService.handleException(TransportService.java:1067)
        at org.elasticsearch.transport.TcpTransport.lambda(TcpTransport.java:1467)
        at org.elasticsearch.common.util.concurrent.EsExecutors.execute(EsExecutors.java:110)
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
Caused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException

eirc on 26 Jul 2017

@eirc any chance you can email me the output that you get from master? My mail is igor at elastic.co. If not, could you try getting one snapshot at a time on the coordinating node and checking what's different between the snapshots that can be retrieved and the snapshots that cause this NPE?

By the way, does the coordinating node have a different es version?

imotov on 26 Jul 2017

Just confirmed all elasticsearches are on 5.5.0. Can I check the version of plugins someway? When I upgraded the stack I remember I had to remove and reinstall plugins to be of proper versions.

I'll make a script to pull each snapshot individually and see which one(s) are breaking now.

eirc on 26 Jul 2017

In 5.5.0 all plugins should be 5.5.0. Otherwise, elasticsearch wouldn't work. In any case, based on what we know so far, I don't think it's a plugin-related issue. Our current theory is that snapshot info serialization code breaks on one or more snapshots that you have in your repository. However, we just reviewed this code and couldn't find any obvious issues. That's why we would like to figure out which snapshot information master is trying to send to the coordinating node in order to reproduce and fix the problem.

imotov on 26 Jul 2017

I emailed you the full snapshot list. My script ~managed to successfully grab each snapshot individually from the coordinating node~ (where grabbing them all failed). I noticed some of the snapshots have some shard failures but that shouldn't be an issue right? Maybe it's the size of the response that's the issue here? I got ~2k snapshots and the response is 1.2 MB.

eirc on 26 Jul 2017

No scratch that, there is a single snapshot which produces the NPE when I get it on it's own.

eirc on 26 Jul 2017

Here is the JSON I can get from the master but not from other nodes:

{
  "snapshots": [
    {
      "snapshot": "wsj-snapshot-20170720085856",
      "uuid": "yIbELYjgQN-_BgjRd4Vb0A",
      "version_id": 5040199,
      "version": "5.4.1",
      "indices": [
        "wsj-2017.07.19",
        "wsj-iis-2017.07.11",
        "wsj-2017.07.08",
        "wsj-2017.07.15",
        "wsj-2017.07.11",
        "wsj-2017.07.12",
        "wsj-2017.07.02",
        "wsj-2017.07.10",
        "wsj-2017.07.06",
        "wsj-2017.06.30",
        "wsj-2017.07.05",
        "wsj-2017.07.14",
        "wsj-2017.07.03",
        "wsj-2017.07.16",
        "wsj-2017.07.17",
        "wsj-2017.07.07",
        "wsj-2017.07.01",
        "wsj-2017.07.09",
        "wsj-2017.07.04",
        "wsj-2017.07.18",
        "wsj-2017.07.13"
      ],
      "state": "PARTIAL",
      "start_time": "2017-07-20T08:58:57.243Z",
      "start_time_in_millis": 1500541137243,
      "end_time": "2017-07-20T11:52:37.938Z",
      "end_time_in_millis": 1500551557938,
      "duration_in_millis": 10420695,
      "failures": [
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.06",
          "index_uuid": "wsj-2017.07.06",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards": {
        "total": 27,
        "failed": 27,
        "successful": 0
      }
    }
  ]
}

Interestingly this snapshot includes that wsj-iis-2017.07.11 index which we then deleted (since due to the naming it would get mixed up a lot with the wsj-* indices) and recreated with another name. Those IndexNotFoundException[no such index] messages look weird though because the mentioned indices do exist, are still on the cluster and I can query them.

eirc on 26 Jul 2017

🏆 deleted the offending snapshot and the listing now works! 🥇

If you need any more info on the "bug" itself I'll be happy to provide. Also my issue is solved but I'll leave this for you to close in case you want to follow the thread deeper.

eirc on 26 Jul 2017

Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it!

imotov on 26 Jul 2017

❤1 👍1

@eirc I spent some time trying to reproduce the issue, but no matter what I try I cannot get my snapshot into the state where it produces nulls in shard failures. It looks like the snapshot in question took place a week ago. Do you remember, by any chance, what was going on with the cluster during this time? Do you still have log files from that day?

imotov on 27 Jul 2017

My current best guess is that that index I mentioned we deleted (wsj-iis) was deleted during the backup process and maybe that mucked up things somehow. I can check the logs at the time for more concrete info but that has to until tomorrow when i get back to work :)

eirc on 27 Jul 2017

Yes, deletion of indices during a snapshot is the first thing I tried. It is producing a slightly different snapshot info that doesn't contain any nulls. It seems that I am missing some key ingredient here. I am done for today as well, but it would be awesome if you could check the logs tomorrow.

imotov on 27 Jul 2017

The issue I see is that the code incorrectly assumes that reason is non-null in case where there is a SnapshotShardFailure. The failure is constructed from a ShardSnapshotStatus object that is in a "failed" state (one of FAILED, ABORTED, MISSING). I see two places where we can possibly have a ShardSnapshotStatus object with "failed" state and where the "reason" can be null:

cluster state serialization (to be precise: SnapshotsInProgress), because we don't serialize the "reason". This means that on master failover it can become null. This scenario can be verified by adding the assertion reason != null to the SnapshotShardFailure constructor and running the (currently disabled) test testMasterShutdownDuringFailedSnapshot a few times.
the call shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED)) when aborting a snapshot. Here it's more difficult to come up with a scenario. But unless we can rule that one out, I would still consider it an issue.

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

ywelsch on 27 Jul 2017

Seems like that index was actually deleted a few days later after all so that was probably a red herring.

Ok there's a huge spike of logs during that snapshot's creation time, I'll try to aggregate what I see as most important:

Related to the snapshot itself (ie searching for "20170720085856")

29 occurrences of

[2017-07-20T14:44:49,461][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-iis-2017.07.11][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot
    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:397) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
    at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1446) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.index.shard.IndexShard.acquireIndexCommit(IndexShard.java:836) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-5.4.1.jar:5.4.1]
    ... 7 more

and 2 of

[2017-07-20T14:44:49,459][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-2017.07.19][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1501) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1494) ~[elasticsearch-5.4.1.jar:5.4.1]
    at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_131]
    at org.elasticsearch.common.io.Streams.copy(Streams.java:76) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.io.Streams.copy(Streams.java:57) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:100) ~[?:?]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1428) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1370) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:967) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

no index state found

1702 occurrences of the following from one data node:

[2017-07-20T14:51:22,103][WARN ][o.e.c.u.IndexFolderUpgrader] [/mnt/elasticsearch-data-02/nodes/0/indices/8oH-hwzeQAmJR7TZkUxf1w] no index state found - ignoring

and one similar from another host

unexpected error while indexing monitoring document

a spike of ~ 2.5k of those at the start of the snapshot:

[2017-07-20T14:44:48,526][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: NodeClosedException[node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}
    ... 15 more

and a similar number of those at the end of the snapshot:

[2017-07-20T14:51:05,408][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: TransportException[transport stopped, action: indices:data/write/bulk[s][p]]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
    at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
    at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
    at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
    at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
    at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:783) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:247) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/write/bulk[s][p]
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
    ... 5 more

node not connected

got 9 of those with at least one for each node

[2017-07-20T14:44:47,437][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [zYawxs4] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [Ht8LDxXGQAGEna893aC57w]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:493) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:204) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesInfo(AbstractClient.java:811) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.rest.action.admin.cluster.RestNodesInfoAction.lambda$prepareRequest$0(RestNodesInfoAction.java:109) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:80) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:260) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:199) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:505) ~[transport-netty4-5.4.1.jar:5.4.1]
    at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:72) ~[transport-netty4-5.4.1.jar:5.4.1]
    at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:63) ~[transport-netty4-5.4.1.jar:5.4.1]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.11.Final.jar:4.1.11.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Ht8LDxX][10.127.1.207:9300] Node not connected
    at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:630) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:116) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:489) ~[elasticsearch-5.4.1.jar:5.4.1]
    ... 58 more

Exception when closing export bulk

3 of those

[2017-07-20T14:44:48,536][WARN ][o.e.x.m.MonitoringService] [Ht8LDxX] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
    at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:138) ~[?:?]
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
    at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
    ... 27 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
    at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:127) ~[?:?]
    ... 25 more

I'm sure there's more stuff in there but I don't know if this actually helps you because I can't make sense of it due to the log volume being that high so I may be missing the important ones. Is there something more specific I could search for that would help? Maybe I should just export all elasticsearch logs for that day and mail them to you?

eirc on 27 Jul 2017

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

@ywelsch Yes fixing it like this would be easy, I just didn't want to assume anything, I wanted to have a test that creates this problem so we can fix it for sure. So, that's why I spent some time trying to reproduce it. You are right about it being null in SnapshotsInProgress, and I tried to reproduce it this way but it looks like it's a completely different path that doesn't get resolved into shard failure object, so this seems to be a dead end. So, I think ABORTED path is more promising and after thinking about for a while, I think the scenario is snapshot gets stuck on a master, gets aborted, then another master takes over, and somehow generates these nulls. The problem with this scenario is that if a snapshot is aborted, it should be deleted afterwards. So, based on the information that @eirc provided, it looks like it might be a combination of stuck snapshot combined with some sort of node failure that prevented the aborted snapshot from being cleaned up, which might be quite difficult to reproduce.

Maybe I should just export all elasticsearch logs for that day and mail them to you?

@eirc that would be very helpful. Thanks!

imotov on 27 Jul 2017

Just a quick update. @ywelsch and I discussed the issue and came up with a plan how to modify testMasterShutdownDuringFailedSnapshot to potentially reproduce the issue. I will try implementing it.

imotov on 27 Jul 2017

Elasticsearch: NullPointerException when getting list of snapshots on S3

Most helpful comment

All 25 comments

Related to the snapshot itself (ie searching for "20170720085856")

no index state found

unexpected error while indexing monitoring document

node not connected

Exception when closing export bulk

Related issues