Beats: [Auditbeat] Avoid having Linux wait on clearing a backlog

Created on 22 May 2018 · 10Comments · Source: elastic/beats

Back-pressure from Auditbeat is propagated to the kernel via the unicast netlink socket buffer and can cause delays in the kernel. The propagation of back-pressure was implemented with the assumption that the kernel drops messages when the backlog queue is full. This assumption is true, but it has an unwanted side-effect. When the backlog queue is full, the kernel will wait for the queue to "drain a little" before providing a buffer to the waiting auditable syscall. If the queue doesn't free up the kernel will log a warning and continue with the syscall.

The waiting period is defined by the audit_backlog_wait_time variable. Prior to v3.14 the variable was not configurable. Then in v3.14 a commit was made to make this configurable through the audit system.

We need to make two changes for Auditbeat:

https://github.com/elastic/go-libaudit/issues/34 - For Linux 3.14+ set the backlog_wait_time to 0 by default to ensure that Auditbeat doens't causes any blocking.
Modify Auditbeat such that the socket reading goroutine does not block when the output is blocked (e.g. back-pressure from the publisher pipeline, this can be mitigated by using spooling to disk) or when processing of events cannot keep up with the rate from the kernel.

For confirmed bugs, please report:

Version: 6.2.x - 6.3.0
Operating System: Linux
Discuss Forum URL: https://discuss.elastic.co/t/auditbeat-impacting-system-performance/131290
Steps to Reproduce:
- Enable the auditd module in unicast mode.
- Audit some high volume syscalls.
- Block the output in some way (bring down LS) or suspend the Auditbeat process.
- Wait for the kernel's audit_backlog_limit to be exceeded. (Messages will start showing up in the kernel log with "audit: backlog limit exceeded". The message is rate limited.)
- Syscalls that are auditable will be wait for the audit_backlog_wait_time period.

Workarounds:

If you have kernel v3.14 or newer and the auditd package installed then you can manually set the audit_backlog_wait_time to 0 with sudo auditctl --backlog_wait_time 0.

Auditbeat bug

Source

andrewkroh

👍4

Most helpful comment

Can we get this cherry picked to 6.x as well?

praseodym on 5 Jun 2018

👍3

All 10 comments

https://bugzilla.redhat.com/show_bug.cgi?id=1437426 deep sigh - not supported even on CentOS 7 :(

Remains to be tested, but I found some references in Redhat docs for RHEL 7 to this option: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-defining_audit_rules_and_controls~~

~~CentOS 7 servers we have are on 3.10 kernel, so I think this might have been back-ported.~~

Edit: further investigation, this seems like it has bee ... discontinued?

dilchenko on 23 May 2018

😕1

@dilchenko Even the latest CentOS version 7.5 doesn't have this feature.

kholia on 23 May 2018

On Ubuntu 18.04 LTS the default value of backlog_wait_time option is 15000.

kholia on 23 May 2018

I’ve hit exactly this issue in a production environment today. Backpressure in Logstash caused Auditbeat to stop reading its buffer and thus make the kernel on multiple machines grind to a halt. Because Auditbeat was also running on the Logstash receiver box, it actually caused a cascading failure due to our Logstash box becoming unresponsive as well. Manual intervention was required to get the Logstash box up and running again, after which everything recovered.

praseodym on 23 May 2018

Also: ~I haven’t tested this yet, but maybe~ using socket_type: multicast could be another workaround? Update: this seems to be working.

praseodym on 23 May 2018

The reason I researched the RHEL7 status is because this is the latest version of a major distribution, and it does not have support for backlog_wait_time. Which means auditbeat is currently capable of basically breaking things in production. In retrospective, I am glad we noticed it right away on a busy box with specific syscall volume/pattern. If this happened during, say, traffic spike - I would be seriously confused: based on [fairly extensive] telemetry we collect, I would have a hard time connecting the ill-effects of back-pressure to the cause of it (audit framework being blocking -ish).

To put it differently: we are offering a product that is guaranteed to break production system for any customer not running 3.14+ kernel with that setting tune. I just want to re-emphasize the importance of this issue.

Some time soon, I will get to testing this on our systems with higher limits for rate/backlog. But we won't be able to roll auditbeat out without a workaround for the waiting issue because we would be running a change of breaking production. The workaround needs to be suitable for use on 2.6 or, at least, 3.10.xxx version of kernel - AFAIU, RHEL7 will stick to 3.10 kernel, so best case is they backport the backlog_wait_time support.

dilchenko on 23 May 2018

👍2

RHEL/CentOS 6 is not EOL until November 2020, so we'll be stuck with kernel 2.6 for another while as well.

Audit netlink multicast is supported since kernel 3.16 so that's probably not in RHEL7 either.

praseodym on 23 May 2018

Can we get this cherry picked to 6.x as well?

praseodym on 5 Jun 2018

👍3

Can we get this cherry picked to 6.x as well?

Noting for posterity: This was released with auditbeat 6.4.0.

jordansissel on 10 Jan 2019

Just curious. Although this was closed, I wonder what the best approach is for systems that don't support audit_backlog_wait_time (basically all RHEL/CentOS 7 versions). Is dropping events in userspace the recommended approach? Also, considering how widely used RHEL/CentOS 7 are, wouldn't it be preferable if some in-memory or on-disk temporary cache was added as an option to handle this scenario?

ossie-git on 12 Sep 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Documentation on filebeat logstash-output is not clear for index setting

JalehD · 3Comments

Filebeat get the wrong Kubernetes Node

pigletfly · 3Comments

Doc: Clarify <stack_product> vs. <stack_product>-xpack modules

ppf2 · 3Comments

[Filebeat] Kubernetes metadata fields not expanded in rollover_alias and policy_name

TomaszKlosinski · 3Comments

Beats developer guide needs to describe how to install mage

dedemorton · 3Comments