Describe the enhancement:
what we eventually determined was that all filebeat processing was being blocked by 1 of the logstash processes which continued to process earlier events sent by filebeat. In fact logstash was sending periodic message updates back to filebeats and these messages were being reported in the filebeat log, although in a manner difficult for the user to understand.
more significantly, what we also found was that a processing block in 1 logstash process caused all of the filebeat worker threads to block. This appears to have been true even though the 2nd logstash process was available for processing.
As a result of this experience, the following enhancements to the logstash "ACK" messaging to beats have been requested:
(1) where beats is sending to multiple logstash processes, modify the ACK protocol so that only those beat worker threads which are connected to the blocking logstash process are blocked while the worker threads connected to non-blocked logstash processes can continue working.
(2) issue explicit messages in the beat log file explaining that specific threads are blocked waiting on logstash to complete processing of previously-sent events.
Developer Comments From Support Ticket
This [problem] was somewhat relaxed in Beats 6.x, by making the publisher asynchrouns and removing the spool as is. Still Filebeat needs processing ACK to be done in order for keeping the registry file in a sane state. That is event Filebeat 6.x might suffer similar issues here.
Filebeat MUST keep order of events when writing the registry file. As filebeat 6.x supports out-of-order batch publishing, all state updates need to be kept in memory in filebeat.
If a batch is never ACKed, then filebeat would require to accumulate all state in memory, eventually going OOM. This is already supported in the small, but at some point in time the queue of state updates is full. This is where even Filebeat 6.x would start to block.
one way to resolve it is:
add a timeout to output workers, duplicating batches to other idle workers once the timeout kicks in. Once a batch is ACKed by one output worker, the other outputs will be cancelled/reset stopping to processing of the current Batch.
The change in the LB algorithm creates a few events that must be locked:
- resend timeout kicks in
- one worker finally ACKing a duplicate batch
- Batch being cancelled
- followup event: cancell success or ACK received (duplicate events).
- The logger/workers are alrready aware of the endpoint. So the actual worker+endpoint will be included in these log messages.
This strategy guarantees progress and is often used to maximize throughput in a dynamically load-balanced system, in case some services sees some slow-down.
The disadvantage is the potential for duplicates, but filebeat already has at least once semantics. So this is no change in semantics at all. If an output missbehaves or an IO error occurs we have to send again.
Related Issues/Enhancement Requests
https://github.com/elastic/apm-server/issues/1298
https://github.com/elastic/beats/issues/8080
https://github.com/elastic/beats/pull/7925
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue doesn't have a Team:<team> label.
馃憤 For updating this. We experience this issue in a slightly different way.
For us, the slowdown of a single logstash node can reduce our overall throughput by more than a factor of what that single logstash node could normally handle. We see this problem whenever backpressure builds to the point of filling up a logstash's queues.
When this happens FB has to wait for the ACK from that logstash instance for all workers. So, despite the fact that we are running 12 logstash nodes, we don't get the throughput of all 12 nodes. This causes a domino effect where 1 of our logstash node's queue remains full, perpetuating the problem until the total volume of input from Filebeat is significantly lower than the output rate of logstash. The sequence of events typically happens like this
We essentially lose the throughput of all 12 logstash nodes. When this happens the total input rate of Filebeat to Logstash also drops pushing the backpressure from Logstash into Filebeat.
Most helpful comment
馃憤 For updating this. We experience this issue in a slightly different way.
For us, the slowdown of a single logstash node can reduce our overall throughput by more than a factor of what that single logstash node could normally handle. We see this problem whenever backpressure builds to the point of filling up a logstash's queues.
When this happens FB has to wait for the ACK from that logstash instance for all workers. So, despite the fact that we are running 12 logstash nodes, we don't get the throughput of all 12 nodes. This causes a domino effect where 1 of our logstash node's queue remains full, perpetuating the problem until the total volume of input from Filebeat is significantly lower than the output rate of logstash. The sequence of events typically happens like this
We essentially lose the throughput of all 12 logstash nodes. When this happens the total input rate of Filebeat to Logstash also drops pushing the backpressure from Logstash into Filebeat.