Beats: Feature: CPU and resource throttling on Beats

Created on 14 Oct 2016 · 4Comments · Source: elastic/beats

In order to be able to install beats as a lightway shipper in any host environment, i would like to know if there is a possibility to add resource throttling in the Beats platform to prevent the host machine to consume all the resources.

The idea and proposal is to natively in beats be able to do this without the need of any 3rd party tool or configuration, so anyone is able to install beats as an agent without wondering around resource consumption.

Also, the idea of this is to ask:

Is this doable in the Beats platform?
What's the effort for this?

discuss enhancement

Source

gmoskovicz

Most helpful comment

We've debated this again, and I still think OS level solutions (cgroups and/or nice) are better suited for this tasks, because the kernel has better data on the resource usage and needs and can apply the limitations better. This means the OS level solution is a lot less likely to introduce negative side effects.

Happy to reconsider if the OS level solutions don't prove to be applicable or to productize them if they do prove applicable.

tsg on 26 May 2017

👍2

All 4 comments

Something that works today is to limit the Beats to a single CPU core, via the max_procs setting. Of course, that can still be too much, so I understand the feature request.

I guess there are two ways we could go about this (just dumping my thoughts to start a discussion on this):

add a variable sleep where the data is generated (i.e. line reading in Filebeat or packet sniffing in Packetbeat) that we increase/decrease depending on the current CPU usage.
use OS tools to limit the CPU (e.g. cgroups on linux) but hide that in our programs, so the user doesn't even have to know we're using OS features. This could be done, for example, as a feature of the init script or during the demonization phase.

Notes:

Option 1 has the advantage that it works on all OSes. Option 2 probably doesn't exist on macOs and I'm not sure about Windows.
Option 2 has the advantage that it works for all Beats (e.g. not sure how to apply option 1 on Metricbeat)
In Option 1 the feedback loop could be quite long, causing the feature to seem broken. For example, we know that multiline events add a significant CPU overhead (because we need to apply regexps). If these expensive events show from time to time, they will make the CPU usage go above the limit for a while. We will adjust the sleeps, but by the time we applied the new sleeps the expensive events have passed, so now we're way below the limit. I can see this coming back at us as a bug report.
Another example where Option 1 will seem broken: If the Beat needs to do heavy processing that's not synchronous to the events (lines in Filebeat, packets in Packetbeat), for example they are busy with garbage collection, increasing the sleeps will eventually bring the Beat to a complete halt while not reducing the CPU time below the limit.

tsg on 14 Oct 2016

@tsg i'm +1 on option number 1, but i see the disadvantages and probably the technical problems that we might be missing that are not considered in your notes. That said, cgroups automatic configuration is a good idea and we can take a look at Job Objects (https://msdn.microsoft.com/en-us/library/windows/desktop/ms684161(v=vs.85).aspx) as a possible solution for windows environments.

I believe that hiding this outside the process (managing this at the OS level) is better and prevents possible issues afterwards.

gmoskovicz on 15 Oct 2016

👍2

Happy to reconsider if the OS level solutions don't prove to be applicable or to productize them if they do prove applicable.

tsg on 26 May 2017

👍2

I created a new meta-issue to track this. Closing this issue for now https://github.com/elastic/beats/issues/17716