fail2ban-client flushlogs using 100% cpu?

Created on 21 Jul 2017  路  18Comments  路  Source: fail2ban/fail2ban

_We will be very grateful, if your problem was described as completely as possible,
enclosing excerpts from logs (if possible within DEBUG mode, if no errors evident
within INFO mode), and configuration in particular of effected relevant settings
(e.g., with fail2ban-client -d | grep 'affected-jail-name' for a particular
jail troubleshooting).
Thank you in advance for the details, because such issues like "It does not work"
alone could not help to resolve anything!
Thanks! (remove this paragraph and other comments upon reading)_

Environment:

_Fill out and check ([x]) the boxes which apply. If your Fail2Ban version is outdated,
and you can't verify that the issue persists in the recent release, better seek support
from the distribution you obtained Fail2Ban from_

  • Fail2Ban version (including any possible distribution suffixes): 0.9.3
  • OS, including release name/version: Ubuntu Linux 16.04.2
  • [YES] Fail2Ban installed via OS/distribution mechanisms
  • [NO] You have not applied any additional foreign patches to the codebase
  • [NONE] Some customizations were done to the configuration (provide details below is so)

The issue:

I have Fail2ban 0.9.3 running on 6 Ubuntu 16.04 servers. Only on one of them (installed and configured same as all others) I often see CPU usage in Webmin showing 100%. I look at what is eating up the CPU and I see:

26591 | root | 99.1 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
-- | -- | -- | --
5979 | root | 98.7 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
16943 | root | 98.2 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
5875 | root | 96.7 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
16039 | root | 80.2 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs

_Summary here_

Steps to reproduce

Unknown

Expected behavior

Observed behavior

Using up 100% cpu.

Every morning at 8:25 am I get 3 emails from Fail2ban that:

ssh-ddos was started
ssh was started
sshd was started

Any additional information

sudo service fail2ban restart has not effect.

Configuration, dump and another helpful excerpts

Excerpt from fail2ban.log showing some weird errors

2017-07-21 10:33:04,742 fail2ban.server         [16051]: INFO    Stopping all jails
2017-07-21 10:33:05,255 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- stdout: b''
2017-07-21 10:33:05,256 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- stderr: b'iptables: Too many links.\n'
2017-07-21 10:33:05,257 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- returned 1
2017-07-21 10:33:05,257 fail2ban.actions        [16051]: ERROR   Failed to stop jail 'sshd' action 'iptables-multiport': Error stopping action
2017-07-21 10:33:05,257 fail2ban.jail           [16051]: INFO    Jail 'sshd' stopped
2017-07-21 10:33:05,928 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- stdout: b''
2017-07-21 10:33:05,928 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- stderr: b'iptables: Too many links.\n'
2017-07-21 10:33:05,929 fail2ban.action         [16051]: ERROR   iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- returned 1
2017-07-21 10:33:05,929 fail2ban.actions        [16051]: ERROR   Failed to stop jail 'ssh' action 'iptables-multiport': Error stopping action
2017-07-21 10:33:05,929 fail2ban.jail           [16051]: INFO    Jail 'ssh' stopped

Any customizations done to /etc/fail2ban/ configuration


Relevant parts of /var/log/fail2ban.log file:

_preferably obtained while running fail2ban with loglevel = 4_


Relevant lines from monitored log files in question:


screen shot 2017-07-21 at 12 32 08 pm

Most helpful comment

Why did you close this?! It seems to me that if fail2ban cannot handle .gz files, it needs to be fixed one way or the other... either it should ignore those files or it should be fixed to support them properly. Because using 100% CPU is not a good result!

All 18 comments

Update: Think I might have found the problem. I forgot on this one server to change the logrotate settings for fail2ban to keep logs for 13 months and not compress them. Seems it was having an issue in one of my jails reading the .gz files. Deleted them and changed my fail2ban logrotate.d settings to

/var/log/fail2ban.log {
    missingok
    notifempty
    monthly
    rotate 13
    create 640 root adm
    postrotate
        /usr/bin/fail2ban-client set logtarget /var/log/fail2ban.log 2> /dev/null || true
    endscript
}

Will monitor over next 24 hours and confirm if this was the cause.

iptables: Too many links.

If you see this error message by stop jail, it means that this chain exists at least twice (resp. sowewhere referenced twice). This can occur for example if fail2ban was not properly stopped (killed).
You should then stop fail2ban and cleanup iptables before you start it again (e. g. iptables-save > /tmp/iptab.txt, remove all the entries starting with f2b- there, and then iptables-restore < /tmp/iptab.txt).

Thanks @sebres I checked iptables -S and can confirm no duplicate jails. I did look through the fail2ban.log and saw it was running into issue reading .gz files which were compressed through logrotate. I unzipped all my auth and fail2ban logs and changed their logrotate settings so that fail2ban can read them all. I keep all auth and fail2ban logs for 13 months uncompressed so that I can pick up repeat offenders in a 13 month period with my blacklist filter.

That's the only difference I could spot on this one server versus the other 5 is that I never modified my logrotate settings for auth and fail2ban.

I will monitor and update you.

@sebres I can confirm this was the case. Fail2ban was failing trying to read .gz log files. My fix to gunzip them all and change the logrotate settings for fail2ban.log and auth.log to not compress them on rollover was the solution. This morning in Webmin a lovely 0% cpu usage, just idling along as it should be. 馃憤

Why did you close this?! It seems to me that if fail2ban cannot handle .gz files, it needs to be fixed one way or the other... either it should ignore those files or it should be fixed to support them properly. Because using 100% CPU is not a good result!

fail2ban cannot handle .gz files, it needs to be fixed

Says who? IMHO it is misconfiguration...

  1. if you configured f2b to parse a binary file with regex (expected strings, short lines, etc), 100% CPU is inevitable.
  2. we've already enough issues about handling the log-files, so we plan to rewrite the whole handling (see gh-1379)
  3. Handling of old (rotated files) does not really preferable for fail2ban (almost always overflows findtime).
  4. The submitter of the issue confirmed it was resolved.

If you have some idea or solution to handle this, welcome with PR, if not - let us decide which issues can be closed, an which should get higher priority.

To be sure this is really not an issue, I've tested it (with both major braches 0.9 (master) and 0.10):

  • all tests running with 2 variants:

    • 3 jails and server within HEAVYDEBUG mode (this guarantees continuously writing into fail2ban.log);

    • 3 jails + recidive jail enabled (server within INFO log-level, because no DEBUG possible during recidive observing)

  • compress enabled in logrotate script
  • 100 times force logrotate with compress (and flushlogs in postrotate).

As result - no error (also no hangers with busy-loop in flushlogs) and successful rotate:

$ ls /var/log/fail2ban.*
/var/log/fail2ban.log  /var/log/fail2ban.log.1.gz /var/log/fail2ban.log.2.gz ...

Thus my conclusion ATM - no issue (at least in current releases without distro-customizations).
BTW I don't know any code-line which can cause some similar behaviour, whether compressed or not compressed. For fail2ban is normally interresting only "/var/log/fail2ban.log" as log-target (if file log-handler) and as input-log (if recidive jail enabled).
Additionally I can't imagine how compress can affect fail2ban (server or client) to cause such kind of issue (also logrotate renames the file before compressing it).

@yarikoptic, @fail2ban/maintainers do you know any distro-related changes (on fail2ban or its configs) that could eventually cause such issue (debian, especially Ubuntu Linux 16.04.2)?

In good old 0.9.3 in Ubuntu? We could investigate (not sure how and why .gz files were added to be monitored) but then what - it is job of an Ubuntu maintainer to fix it in Union. 0.9.x backports available for that Ubuntu from NeuroDebian if someone had desire. If those show that behaviour then might be more reasonable to look 馃榾

Looking closer, the paths to the logs are not including the .gz files in my case, but I do get the same behavior: 100% CPU usage. One of my VM ran the flushlogs command for over 500 hours... That's 21 days in a row. I'll be monitoring the computers that run fail2ban to see if it happens again. What should I be looking at to have an idea where the command is blocked?

The one I assumed would be wrong is correct. It only uses the last .log file, no .gz involved:

INI [recidive] logpath = /var/log/fail2ban.log banaction = iptables-allports bantime = 604800 ; 1 week findtime = 86400 ; 1 day maxretry = 5

So what I'm saying is that it could have been blocked on any one of the files and not specifically on the recidive (which accesses the fail2ban.log files specifically.)

Could there be a problem if a log file gets rotated while fail2ban-client flushlogs is running?

One of my VM ran the flushlogs command for over 500 hours...

As I said above - not reproduceable at all (with current versions).

Could there be a problem if a log file gets rotated while fail2ban-client flushlogs is running?

Maybe, but I've tried it also... And normally fail2ban-client flushlogs takes few milliseconds.

Which fail2ban version and distribution do you've? Also 0.9.3 / Ubuntu?

Yes, I'm on the same version of Ubuntu as mitchellkrogza.

It also did not happen last night. I have two clusters with a total of 15 computers plus my own box and I only saw the problem happen on 2 machines (although I did not verify every single machine and they were rebooted since then...).

Thank you for taking some time to check in the matter. I'll investigate further once it happens again (well, hopefully it does not, but since it happened before...)

@AlexisWilke I do agree, Fail2ban used to read .gz files perfectly ok and I would personally like to see Fail2ban being able to do that again.

Fail2ban being able to do that again

For which purposes should it be good? I mean why fail2ban should read old (rotated) log-files?

@sebres

As Mitchell Krogza mentioned in this comment:

https://github.com/fail2ban/fail2ban/issues/1843#issuecomment-316978699

He was to find repeat offenders. That means accessing possibly many fail2ban.log files. It could also be done with any other type of logs.

However, I ran in the problem and I do not hit the .gz files, so there is something that can make fail2ban after logrotate go to 100% CPU. I just do not know what it is at this point. But for me it was not the .gz files...

I have also had this 100% CPU useage problem. I am using the latest 0.9 fail2ban. I came across issue #1903 and thought it may be part of the problem. My logrotate is "fail2ban-client flushlogs 1>/dev/null". Is that correct or should I change it to "fail2ban-client flushlogs >/dev/null"? I had to kill the fail2ban-client flushlogs process to temporarily fix it, but it will inevitably arise again at next rotation (weekly).

@Bulmer, note that the "latest" on Ubuntu and the latest available version of fail2ban are not quite the same. The author says that this is possibly fixed in 0.10.

The other day, I saw one of my computer with both: logrotate and fail2ban stuck while _working_ on some file. So needless to say, that computer was slow with 2 processes using 100% of the two CPUs on that server... it's equivalent to no more CPU resources! Ouch.

The >/dev/null and 1>/dev/null syntaxes are 100% equivalent so that would make no difference.

However, it should not happen each time. On my end it takes about a month to happen and I logrotate / fail2ban flush on a daily basis.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

enoch85 picture enoch85  路  23Comments

propertunist picture propertunist  路  22Comments

alasdairdc picture alasdairdc  路  18Comments

Octolus picture Octolus  路  37Comments

thezoggy picture thezoggy  路  17Comments