Elasticsearch: Investigate and Fix Serialization Issue with IngestStats

Created on 13 Feb 2020 · 7Comments · Source: elastic/elasticsearch

We have a pretty detailed report about ingest stats not serializing properly in https://discuss.elastic.co/t/netty4tcpchannel-negative-longs-unsupported-repeated-in-logs/219235/6

What it comes down to is that the number of currently executing processors is a negative value somehow and doesn't serialise because of it (+ it should't be negative obviously) :

[2020-02-13T15:56:52,878][WARN ][o.e.t.OutboundHandler    ] [ela3] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.95:9300, remoteAddress=/A.B.C.93:56542}]
java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-84034]
        at org.elasticsearch.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:299) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats$Stats.writeTo(IngestStats.java:197) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.ingest.IngestStats.writeTo(IngestStats.java:103) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:897) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:255) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:87) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:64) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:166) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:152) ~[elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:199) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:129) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:124) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:104) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:244) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:240) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) [x-pack-security-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.6.0.jar:7.6.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.6.0.jar:7.6.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

This seems to be caused by a pipeline throwing:

[2020-02-13T17:53:31,229][DEBUG][o.e.a.b.T.BulkRequestModifier] [ela1] failed to execute pipeline [_none] for document [filebeat-7.6.0/_doc/null]

I didn't investigate the deeper cause here but I'm assuming on error there's too many dec calls to org.elasticsearch.ingest.IngestMetric#ingestCurrent by some path.

:CorFeatureIngest >bug CorFeatures

Source

original-brownbear

👍1

Most helpful comment

To be clear about what I mean here, line 51 succeeds, handler.accept(result, null) on line 52 is invoked and the current metric is decremented, then a subsequent exception is thrown later in that accept invocation (there's a lot of potential code that executes after that), so we end up in the catch block on line 54 invoking handler.accept(null, e), invoking accept again, leading to a double decrement. It's effectively a double notification problem.

jasontedor on 14 Feb 2020

👍3

All 7 comments

Pinging @elastic/es-core-features (:Core/Features/Ingest)

elasticmachine on 13 Feb 2020

The bug here is that accept on Processor#execute the handler parameter can be called multiple times if there is a failure, and that accept method decrements the counter:

https://github.com/elastic/elasticsearch/blob/de6f132887c3fce21482002f432f9802feed3db8/server/src/main/java/org/elasticsearch/ingest/Processor.java#L49-L56

This explains the over-decrementing, and thus the negative values.

jasontedor on 13 Feb 2020

jasontedor on 14 Feb 2020

👍3

+1 - I am also seeing this after an upgrade from 7.5.2 to 7.6.0. Although all my data nodes do show up on Stack Monitoring, there are zero metrics for any of them, and the node count does not include the data nodes. I was directed to this issue at the AMA booth of Elastic{ON} Anaheim. Please let me know if there is any information I can provide from my cluster to help.

MakoWish on 18 Feb 2020

Thank you, @MakoWish. We're working on a fix for it and we'll ping you if additional information would be helpful.

danhermann on 18 Feb 2020

Reopening this issue, as we still see the failures in stats on negative processors number in 7.6.1 where the PR #52543 was merged

mayya-sharipova on 29 Apr 2020

See also the additional information that @srikwit provided on another instance of this bug here: https://github.com/elastic/elasticsearch/issues/62087#issuecomment-689453026

danhermann on 9 Sep 2020

Was this page helpful?

0 / 5 - 0 ratings