Beats: Improvements To Beats/Logstash "ACK" Protocol Including Covering Load Balancing and Log Messaging

Created on 3 Apr 2019 · 3Comments · Source: elastic/beats

Describe the enhancement:

This ER arose from a Support Ticket. User opened a ticket to resolve intermittent problems where filebeat processing just seemed to stop for no identifiable reason. The configuration included filebeat sending events 2 logstash processes which were publishing to kafka.

what we eventually determined was that all filebeat processing was being blocked by 1 of the logstash processes which continued to process earlier events sent by filebeat. In fact logstash was sending periodic message updates back to filebeats and these messages were being reported in the filebeat log, although in a manner difficult for the user to understand.
more significantly, what we also found was that a processing block in 1 logstash process caused all of the filebeat worker threads to block. This appears to have been true even though the 2nd logstash process was available for processing.

As a result of this experience, the following enhancements to the logstash "ACK" messaging to beats have been requested:

(1) where beats is sending to multiple logstash processes, modify the ACK protocol so that only those beat worker threads which are connected to the blocking logstash process are blocked while the worker threads connected to non-blocked logstash processes can continue working.

(2) issue explicit messages in the beat log file explaining that specific threads are blocked waiting on logstash to complete processing of previously-sent events.

Developer Comments From Support Ticket

This [problem] was somewhat relaxed in Beats 6.x, by making the publisher asynchrouns and removing the spool as is. Still Filebeat needs processing ACK to be done in order for keeping the registry file in a sane state. That is event Filebeat 6.x might suffer similar issues here.

Filebeat MUST keep order of events when writing the registry file. As filebeat 6.x supports out-of-order batch publishing, all state updates need to be kept in memory in filebeat.
If a batch is never ACKed, then filebeat would require to accumulate all state in memory, eventually going OOM. This is already supported in the small, but at some point in time the queue of state updates is full. This is where even Filebeat 6.x would start to block.

one way to resolve it is:

add a timeout to output workers, duplicating batches to other idle workers once the timeout kicks in. Once a batch is ACKed by one output worker, the other outputs will be cancelled/reset stopping to processing of the current Batch.

The change in the LB algorithm creates a few events that must be locked:

resend timeout kicks in

one worker finally ACKing a duplicate batch

Batch being cancelled

followup event: cancell success or ACK received (duplicate events).

The logger/workers are alrready aware of the endpoint. So the actual worker+endpoint will be included in these log messages.

This strategy guarantees progress and is often used to maximize throughput in a dynamically load-balanced system, in case some services sees some slow-down.
The disadvantage is the potential for duplicates, but filebeat already has at least once semantics. So this is no change in semantics at all. If an output missbehaves or an IO error occurs we have to send again.

Related Issues/Enhancement Requests

https://github.com/elastic/apm-server/issues/1298
https://github.com/elastic/beats/issues/8080
https://github.com/elastic/beats/pull/7925

Filebeat enhancement needs_team

Source

MorrieAtElastic

👍3

Most helpful comment

👍 For updating this. We experience this issue in a slightly different way.

For us, the slowdown of a single logstash node can reduce our overall throughput by more than a factor of what that single logstash node could normally handle. We see this problem whenever backpressure builds to the point of filling up a logstash's queues.

When this happens FB has to wait for the ACK from that logstash instance for all workers. So, despite the fact that we are running 12 logstash nodes, we don't get the throughput of all 12 nodes. This causes a domino effect where 1 of our logstash node's queue remains full, perpetuating the problem until the total volume of input from Filebeat is significantly lower than the output rate of logstash. The sequence of events typically happens like this

Logstash backpressure builds over time
Logstash queues reach maximum capacity
Filebeat workers begin blocking on the ACKs from Logstash
At a random point, all Filebeat nodes are blocked on the same Logstash instance
The other logstash queues start to clear out because Filebeat is blocked on 1 of the nodes
As messages are processed from the full Logstash node, Filebeat nodes get their ACK and the workers push batches to the free logstash instances very quickly
All Filebeat nodes quickly become blocked on the full logstash node again
The problem repeats

We essentially lose the throughput of all 12 logstash nodes. When this happens the total input rate of Filebeat to Logstash also drops pushing the backpressure from Logstash into Filebeat.

drock on 27 Jul 2020

👍2 👀1

All 3 comments

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

botelastic[bot] on 8 Jul 2020

This issue doesn't have a Team:<team> label.

botelastic[bot] on 8 Jul 2020

👍 For updating this. We experience this issue in a slightly different way.

Logstash backpressure builds over time
Logstash queues reach maximum capacity
Filebeat workers begin blocking on the ACKs from Logstash
At a random point, all Filebeat nodes are blocked on the same Logstash instance
The other logstash queues start to clear out because Filebeat is blocked on 1 of the nodes
As messages are processed from the full Logstash node, Filebeat nodes get their ACK and the workers push batches to the free logstash instances very quickly
All Filebeat nodes quickly become blocked on the full logstash node again
The problem repeats

We essentially lose the throughput of all 12 logstash nodes. When this happens the total input rate of Filebeat to Logstash also drops pushing the backpressure from Logstash into Filebeat.

drock on 27 Jul 2020

👍2 👀1

Was this page helpful?

0 / 5 - 0 ratings