_We will be very grateful, if your problem was described as completely as possible,
enclosing excerpts from logs (if possible within DEBUG mode, if no errors evident
within INFO mode), and configuration in particular of effected relevant settings
(e.g., with fail2ban-client -d | grep 'affected-jail-name'
for a particular
jail troubleshooting).
Thank you in advance for the details, because such issues like "It does not work"
alone could not help to resolve anything!
Thanks! (remove this paragraph and other comments upon reading)_
_Fill out and check ([x]
) the boxes which apply. If your Fail2Ban version is outdated,
and you can't verify that the issue persists in the recent release, better seek support
from the distribution you obtained Fail2Ban from_
I have Fail2ban 0.9.3 running on 6 Ubuntu 16.04 servers. Only on one of them (installed and configured same as all others) I often see CPU usage in Webmin showing 100%. I look at what is eating up the CPU and I see:
26591 | root | 99.1 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
-- | -- | -- | --
5979 | root | 98.7 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
16943 | root | 98.2 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
5875 | root | 96.7 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
16039 | root | 80.2 % | /usr/bin/python3 /usr/bin/fail2ban-client flushlogs
_Summary here_
Unknown
Using up 100% cpu.
Every morning at 8:25 am I get 3 emails from Fail2ban that:
ssh-ddos was started
ssh was started
sshd was started
sudo service fail2ban restart has not effect.
Excerpt from fail2ban.log showing some weird errors
2017-07-21 10:33:04,742 fail2ban.server [16051]: INFO Stopping all jails
2017-07-21 10:33:05,255 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- stdout: b''
2017-07-21 10:33:05,256 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- stderr: b'iptables: Too many links.\n'
2017-07-21 10:33:05,257 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-sshd
iptables -w -F f2b-sshd
iptables -w -X f2b-sshd -- returned 1
2017-07-21 10:33:05,257 fail2ban.actions [16051]: ERROR Failed to stop jail 'sshd' action 'iptables-multiport': Error stopping action
2017-07-21 10:33:05,257 fail2ban.jail [16051]: INFO Jail 'sshd' stopped
2017-07-21 10:33:05,928 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- stdout: b''
2017-07-21 10:33:05,928 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- stderr: b'iptables: Too many links.\n'
2017-07-21 10:33:05,929 fail2ban.action [16051]: ERROR iptables -w -D INPUT -p tcp -m multiport --dports ssh -j f2b-ssh
iptables -w -F f2b-ssh
iptables -w -X f2b-ssh -- returned 1
2017-07-21 10:33:05,929 fail2ban.actions [16051]: ERROR Failed to stop jail 'ssh' action 'iptables-multiport': Error stopping action
2017-07-21 10:33:05,929 fail2ban.jail [16051]: INFO Jail 'ssh' stopped
_preferably obtained while running fail2ban with loglevel = 4
_
Update: Think I might have found the problem. I forgot on this one server to change the logrotate settings for fail2ban to keep logs for 13 months and not compress them. Seems it was having an issue in one of my jails reading the .gz files. Deleted them and changed my fail2ban logrotate.d settings to
/var/log/fail2ban.log {
missingok
notifempty
monthly
rotate 13
create 640 root adm
postrotate
/usr/bin/fail2ban-client set logtarget /var/log/fail2ban.log 2> /dev/null || true
endscript
}
Will monitor over next 24 hours and confirm if this was the cause.
iptables: Too many links.
If you see this error message by stop jail, it means that this chain exists at least twice (resp. sowewhere referenced twice). This can occur for example if fail2ban was not properly stopped (killed).
You should then stop fail2ban and cleanup iptables before you start it again (e. g. iptables-save > /tmp/iptab.txt
, remove all the entries starting with f2b-
there, and then iptables-restore < /tmp/iptab.txt
).
Thanks @sebres I checked iptables -S and can confirm no duplicate jails. I did look through the fail2ban.log and saw it was running into issue reading .gz files which were compressed through logrotate. I unzipped all my auth and fail2ban logs and changed their logrotate settings so that fail2ban can read them all. I keep all auth and fail2ban logs for 13 months uncompressed so that I can pick up repeat offenders in a 13 month period with my blacklist filter.
That's the only difference I could spot on this one server versus the other 5 is that I never modified my logrotate settings for auth and fail2ban.
I will monitor and update you.
@sebres I can confirm this was the case. Fail2ban was failing trying to read .gz log files. My fix to gunzip them all and change the logrotate settings for fail2ban.log and auth.log to not compress them on rollover was the solution. This morning in Webmin a lovely 0% cpu usage, just idling along as it should be. 馃憤
Why did you close this?! It seems to me that if fail2ban cannot handle .gz files, it needs to be fixed one way or the other... either it should ignore those files or it should be fixed to support them properly. Because using 100% CPU is not a good result!
fail2ban cannot handle .gz files, it needs to be fixed
Says who? IMHO it is misconfiguration...
findtime
).If you have some idea or solution to handle this, welcome with PR, if not - let us decide which issues can be closed, an which should get higher priority.
To be sure this is really not an issue, I've tested it (with both major braches 0.9 (master) and 0.10):
As result - no error (also no hangers with busy-loop in flushlogs) and successful rotate:
$ ls /var/log/fail2ban.*
/var/log/fail2ban.log /var/log/fail2ban.log.1.gz /var/log/fail2ban.log.2.gz ...
Thus my conclusion ATM - no issue (at least in current releases without distro-customizations).
BTW I don't know any code-line which can cause some similar behaviour, whether compressed or not compressed. For fail2ban is normally interresting only "/var/log/fail2ban.log" as log-target (if file log-handler) and as input-log (if recidive jail enabled).
Additionally I can't imagine how compress can affect fail2ban (server or client) to cause such kind of issue (also logrotate renames the file before compressing it).
@yarikoptic, @fail2ban/maintainers do you know any distro-related changes (on fail2ban or its configs) that could eventually cause such issue (debian, especially Ubuntu Linux 16.04.2)?
In good old 0.9.3 in Ubuntu? We could investigate (not sure how and why .gz files were added to be monitored) but then what - it is job of an Ubuntu maintainer to fix it in Union. 0.9.x backports available for that Ubuntu from NeuroDebian if someone had desire. If those show that behaviour then might be more reasonable to look 馃榾
Looking closer, the paths to the logs are not including the .gz files in my case, but I do get the same behavior: 100% CPU usage. One of my VM ran the flushlogs
command for over 500 hours... That's 21 days in a row. I'll be monitoring the computers that run fail2ban to see if it happens again. What should I be looking at to have an idea where the command is blocked?
The one I assumed would be wrong is correct. It only uses the last .log file, no .gz involved:
INI
[recidive]
logpath = /var/log/fail2ban.log
banaction = iptables-allports
bantime = 604800 ; 1 week
findtime = 86400 ; 1 day
maxretry = 5
So what I'm saying is that it could have been blocked on any one of the files and not specifically on the recidive (which accesses the fail2ban.log files specifically.)
Could there be a problem if a log file gets rotated while fail2ban-client flushlogs
is running?
One of my VM ran the flushlogs command for over 500 hours...
As I said above - not reproduceable at all (with current versions).
Could there be a problem if a log file gets rotated while fail2ban-client flushlogs is running?
Maybe, but I've tried it also... And normally fail2ban-client flushlogs
takes few milliseconds.
Which fail2ban version and distribution do you've? Also 0.9.3 / Ubuntu?
Yes, I'm on the same version of Ubuntu as mitchellkrogza.
It also did not happen last night. I have two clusters with a total of 15 computers plus my own box and I only saw the problem happen on 2 machines (although I did not verify every single machine and they were rebooted since then...).
Thank you for taking some time to check in the matter. I'll investigate further once it happens again (well, hopefully it does not, but since it happened before...)
@AlexisWilke I do agree, Fail2ban used to read .gz files perfectly ok and I would personally like to see Fail2ban being able to do that again.
Fail2ban being able to do that again
For which purposes should it be good? I mean why fail2ban should read old (rotated) log-files?
@sebres
As Mitchell Krogza mentioned in this comment:
https://github.com/fail2ban/fail2ban/issues/1843#issuecomment-316978699
He was to find repeat offenders. That means accessing possibly many fail2ban.log files. It could also be done with any other type of logs.
However, I ran in the problem and I do not hit the .gz files, so there is something that can make fail2ban after logrotate go to 100% CPU. I just do not know what it is at this point. But for me it was not the .gz files...
I have also had this 100% CPU useage problem. I am using the latest 0.9 fail2ban. I came across issue #1903 and thought it may be part of the problem. My logrotate is "fail2ban-client flushlogs 1>/dev/null". Is that correct or should I change it to "fail2ban-client flushlogs >/dev/null"? I had to kill the fail2ban-client flushlogs process to temporarily fix it, but it will inevitably arise again at next rotation (weekly).
@Bulmer, note that the "latest" on Ubuntu and the latest available version of fail2ban are not quite the same. The author says that this is possibly fixed in 0.10.
The other day, I saw one of my computer with both: logrotate and fail2ban stuck while _working_ on some file. So needless to say, that computer was slow with 2 processes using 100% of the two CPUs on that server... it's equivalent to no more CPU resources! Ouch.
The >/dev/null
and 1>/dev/null
syntaxes are 100% equivalent so that would make no difference.
However, it should not happen each time. On my end it takes about a month to happen and I logrotate / fail2ban flush on a daily basis.
Most helpful comment
Why did you close this?! It seems to me that if fail2ban cannot handle .gz files, it needs to be fixed one way or the other... either it should ignore those files or it should be fixed to support them properly. Because using 100% CPU is not a good result!