We have a pretty detailed report about ingest stats not serializing properly in https://discuss.elastic.co/t/netty4tcpchannel-negative-longs-unsupported-repeated-in-logs/219235/6
What it comes down to is that the number of currently executing processors is a negative value somehow and doesn't serialise because of it (+ it should't be negative obviously) :
[2020-02-13T15:56:52,878][WARN ][o.e.t.OutboundHandler ] [ela3] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.95:9300, remoteAddress=/A.B.C.93:56542}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-84034]
at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:299) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:103) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:897) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:255) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:166) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:152) ~[elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:199) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:129) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:124) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:104) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.0.jar:7.6.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
This seems to be caused by a pipeline throwing:
[2020-02-13T17:53:31,229][DEBUG][o.e.a.b.T.BulkRequestModifier] [ela1] failed to execute pipeline [_none] for document [filebeat-7.6.0/_doc/null]
I didn't investigate the deeper cause here but I'm assuming on error there's too many dec calls to org.elasticsearch.ingest.IngestMetric#ingestCurrent by some path.
Pinging @elastic/es-core-features (:Core/Features/Ingest)
The bug here is that accept on Processor#execute the handler parameter can be called multiple times if there is a failure, and that accept method decrements the counter:
This explains the over-decrementing, and thus the negative values.
To be clear about what I mean here, line 51 succeeds, handler.accept(result, null) on line 52 is invoked and the current metric is decremented, then a subsequent exception is thrown later in that accept invocation (there's a lot of potential code that executes after that), so we end up in the catch block on line 54 invoking handler.accept(null, e), invoking accept again, leading to a double decrement. It's effectively a double notification problem.
+1 - I am also seeing this after an upgrade from 7.5.2 to 7.6.0. Although all my data nodes do show up on Stack Monitoring, there are zero metrics for any of them, and the node count does not include the data nodes. I was directed to this issue at the AMA booth of Elastic{ON} Anaheim. Please let me know if there is any information I can provide from my cluster to help.
Thank you, @MakoWish. We're working on a fix for it and we'll ping you if additional information would be helpful.
Reopening this issue, as we still see the failures in stats on negative processors number in 7.6.1 where the PR #52543 was merged
See also the additional information that @srikwit provided on another instance of this bug here: https://github.com/elastic/elasticsearch/issues/62087#issuecomment-689453026
Most helpful comment
To be clear about what I mean here, line 51 succeeds,
handler.accept(result, null)on line 52 is invoked and the current metric is decremented, then a subsequent exception is thrown later in thatacceptinvocation (there's a lot of potential code that executes after that), so we end up in the catch block on line 54 invokinghandler.accept(null, e), invokingacceptagain, leading to a double decrement. It's effectively a double notification problem.