fail2ban slow shutdown (flush, forcestop resp. stop_without_unban)

Created on 23 Oct 2016  路  29Comments  路  Source: fail2ban/fail2ban

On high traffic sites (1000s/minute requests) fail2ban chokes during shutdown.

Blocking Bots for high traffic sites can generate 1000s of iptables entries which can be deleted individually in a timely fashion, which creates several problems.

1) service fail2ban stop/restart - can require minutes to complete.

2) service fail2ban stop/restart - fails due to timeout expiration on command

3) host reboots - hang for many minutes while many containers are hung in fail2ban, as it attempts to delete each chain sequentially

3 is a real killer, because other services shutdown quickly, so connectivity to Websites running in each container is lost, while fail2ban waits to delete each iptables rule individually.

Fix is simple.

For all fail2ban chains (beginning with "f2b-") flush the chain, then delete the chain.

This logic change means fail2ban completes its entire teardown/shutdown sequence in seconds rather than minutes.

The simple perl script I use for fail2ban teardown usually executes <5 seconds, where normal fail2ban teardown can take 5-10+ minutes.

Imagine 64 containers on a machine, where each fail2ban took 5 minutes to shutdown. Since shutdown time is CPU bound running iptables, each CPU tends to peg toward 100%. If there are 16 threads on the machine, then roughly 100% CPU will each up around...

20 minutes == 5 minutes/fail2ban * 4 (64 fail2ban instances @ 100% CPU / 16 threads).

I've begun wiring my script into systemd, so fail2ban-fast-stop (script) runs before normal fail2ban shutdown.

Happy to provide this script to anyone who likes. I'll place my fail2ban-fast-stop script somewhere + add update this ticket shortly.

enhancement

Most helpful comment

Ubuntu Trusty, Fail2ban 0.8.11, 200 banned IP-s : It tooks 1 second to stop the daemon.
Ubuntu Xenial, Fail2ban 0.9.3, 200 banned IP-s : It tooks 40 seconds to stop the daemon because it unbans every banned IP in the chain.

So +1 for stop_without_unban

All 29 comments

This also creates another problem.

Since fail2ban errors out before all rules are deleted, to actually remove all rules requires manual intervention of flush chain + delete chain.

I'm not against suggested solution (already several times discussed, and theoretically we can extend our action logic, in particular iptables-actions, with additional parameters like stop_without_unban or forcestop and extend its actionstop to flush chains before delete it), but as regards abovementioned subject (slow shutdown)...

We've revised the completely logic of our ban-manager in 0.10th branch, so it is extremely fast now.
E. g. unbanning of 1000 IPs takes by me about 10-20 seconds (ban/unban takes few ms per IP now).

Still again - not against flushing at all, but a lot of changes in ban-manager and actions modules is necessary (e. g. should prevent to execute unban by shutdown of some actions, etc.), so as an interim solution possible a switch to 0.10th?

Tried switching to fail2ban 0.10 once + installation polluted the OS level fail2ban installation. Took me hours to recover.

If there's an installation sequence where I can install in completely separate directory structure... never touching any OS level files... I'm happy to test, otherwise I'll wait till Debian/Ubuntu packages are available.

If there's a 0.10 Ubuntu PPA, drop the link here.

Or installation instructions to install in separate directory structure... including all the python modules, which is what overwrote my OS packages... let me know...

Thanks.

If there's an installation sequence where I can install in completely separate directory structure... never touching any OS level files...

I'll tell you more - to test it, you do not need to install fail2ban at all.
You can create a standalone test instance parallel to your stock fail2ban (fail2ban works without installation). See https://github.com/sebres/fail2ban/wiki/How-to-test-newer-fail2ban-version-resp.-use-fail2ban-standalone-instance

like stop_without_unban

Now I like that idea. I do not have many IPs and it still takes my test VMs a good 30 seconds to shutdown (although it may be better with 0.10, as you mentioned).

It does not seem useful to unban (unless you are testing.) An unban should only occur when the ban for a host times out in the database. I fail to see why it would be required otherwise.

I fail to see why it would be required otherwise.

because the API of fail2ban requires it currently, because exists many other actions as iptables, that may require unban (or not allow flush), and yet 10 another because...

Just wondering... could some sort of ban cause problems for an upgrade?

Because at this time, if I run an upgrade of fail2ban on my machine, fail2ban stops, removing all the bans, then once the upgrade is done, it restarts. (Assuming the process works without quirks... the slowness may break the stop putting the system in an intermediate state.)

That means, while upgrading, all the bans are gone for _some time_...

Another security related issue because of that bug, just so you have a trace of it...

When someone restarts their computer, it will not have all those IPs in their firewall for about the same amount of time. So if it took something like 20min. to shutdown, it will take 20min. to restart. That means if you start using the computer immediately, you do so without all the IPs you normally have blocked. In most cases, it is probably benign, though. (unless those were for brute force attacks of a login, then for 20 min. they may be free to run their attack.)

Ubuntu Trusty, Fail2ban 0.8.11, 200 banned IP-s : It tooks 1 second to stop the daemon.
Ubuntu Xenial, Fail2ban 0.9.3, 200 banned IP-s : It tooks 40 seconds to stop the daemon because it unbans every banned IP in the chain.

So +1 for stop_without_unban

We've revised the completely logic of our ban-manager in 0.10th branch, so it is extremely fast now.
E. g. unbanning of 1000 IPs takes by me about 10-20 seconds (ban/unban takes few ms per IP now).

@sebres I tried the latest 0.10 branch and when starting up, it's still slow -- restoring bans for 2-3 ips per second. Our server load is low, and no swap activities. I even stopped web server so f2b can just concentrate on restoring old bans.

With 100k or even 50k ips to ban, it'll take a very long time. Any ideas? Thanks.

restoring bans for 2-3 ips per second

Any evidences? :) I mean a log excerpt...

Is the initially proposed solution being implemented? It seems like the smartest and easiest way to go and that's what I end up doing anyway after systemctl fails on timeout. Flush and delete the chains that is. Even as a configuration option it would be nice. I'd be happy to beta test it

@davidfavor please could you share your script?

My fail2ban installation (v0.9.2 as distributed with Plesk) is currently taking about 3hrs to stop / start and growing!

NB. This also happens every time an IP is manually unbanned or added to the trusted IP list (I have tight rules and occasionally need to trust certain client IPs that I know are simply having finger trouble with email clients for example). It then takes 3hrs+ (and '000s of notification emails later) before fail2ban is ready again...

@MadAdaM3 I'm afraid you'll have to download the latest (0.10) and use that until it becomes a full release in your distro.

@AlexisWilke That's not an option as it will not be compatible with the Plesk front end integration. In any event, reading the above it isn't certain that the 0.10 enhancements will solve the issue.

@davidfavor had offered to share his script for fail2ban-fast-stop which should be possible to integrate into the Plesk management scripts, hence I was hoping he could share it with the wider audience?

implemented in #1743, please test (and possibly a PR for more actions:)

@MadAdaM3 Hey bud, were you able to resolve this issue with fail2ban in Plesk? I am having the same exact issue and its becoming EXTREMELY inconvenient. Let alone Plesk recently decided to increase the price in license, but they still haven't taken the time to fix important issues such as this. Smh

Please advise, thanks.

@jxs714 Not exactly. My Plesk still has fail2ban 0.9.6 so we're a way off seeing any performance improvements! What I did to mitigate the issue, was to remove "sendmail-whois-lines" from the jails that have the biggest list of long-term IPs (like recidive) and just use "sendmail" instead - the reverse DNS lookup for each IP was taking a lot of the time. Finally, you can remove the sendmail action on start/stop altogether which makes a huge difference: https://serverfault.com/questions/257439/stop-fail2ban-stop-start-notifications (see the post by mivk):

"In my config, the output is 'sendmail-whois-lines', so that is the file to edit. Assuming your config is under /etc/fail2ban, the full file name is /etc/fail2ban/action.d/sendmail-whois-lines.conf.

However, as Rabin mentions, do not edit that file directly, because it will be overwritten during updates. Instead, create /etc/fail2ban/action.d/sendmail-whois-lines.local (or whatever action.d/file-name.local is right in your config) and add these lines:
[Definition]
actionstart =
actionstop =
"

Hope that helps some.

@MadAdaM3 Hey bud,

Thanks for the response! Weirdly I am not seeing anything mentioning "sendmail-whois-lines" within any of the jails or the fail2ban folder in general. I did create the sendmail-whois-lines.conf and it seems to have not made a difference at all since I cannot start fail2ban after doing so. I am using CentOS 6.9.

Any advice? Please advise, thanks.

@jxs714 The same applies to any "sendmail" or "sendmail-whois" entries in your jails. Are you receiving lots of email whenever you perform a server restart? If you have a "sendmail" entry, you can create a "sendmail.local" file with the 3 lines above to stop getting all the emails every start/stop - and will speed up the process.

@MadAdaM3 Hey bud. Nope, i receive very light email updates. It just seems to be very slow starting and stopping, as well as adding banned/trusted IP's. It gets very very slow and unresponsive in time to the point that you can no longer start it. Not sure if it may be a huge log file that needs to be cleared, the database for fail2ban, or both. Very weird behavior, and very very disappointing towards security.

Please advise, thanks.

Performance Testing

  • 5000 IP addresses banned in jail 'sshd'
  • Try unban them all and reload jail
  • 2c 1g VPS

| Version | Unban Method | Clear Action | Time |
| --- | --- | --- | --- |
| 0.9.7 | SQLite command | No - By default | 1069s |
| 0.9.7 (walkround) | SQLite command | Yes | 40s |
| 0.10.1 | reload --unban | No - By default | 78s |

Walkround

  • Any version of fail2ban
  • Using 'iptables-xxxx' action
    ### Step 1 Delete from SQLite database
    > sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 "delete from bans where jail='sshd'"
    ### Step 2 Set actioncheck/actionunban/actionstop to empty
    > fail2ban-client set 'sshd' action 'iptables-allports' actioncheck ''
    > fail2ban-client set 'sshd' action 'iptables-allports' actionunban ''
    > fail2ban-client set 'sshd' action 'iptables-allports' actionstop ''
    ### Step 3 Clear iptables chains
    > iptables -D INPUT -p tcp -j f2b-sshd
    > iptables -F f2b-sshd
    > iptables -X f2b-sshd
    ### Step 4 Reload jail
    > fail2ban-client reload 'sshd'

@cnrat I've large deviation from your tests (by 0.10th branch), which is many times faster in my case as 0.9th (both without actions).
I've implemented now the flush in the database (additionally to flush in actions) in 2c69c0e7e5cfa6646afea032eb10278d75cab6fc.
Could you please repeat the test for the latest 0.10th again in your environment?

@sebres My apologiy. The test above including a same method that moving all 5000 IPs into a new chain of iptables. I repeat the test with the latest 0.10.1 release.

| Version | Time of reloading only |
| --- | --- |
| 0.9.7 with walkround | 0.856s |
| 0.10.1 with --unban | 1.123s |

1s vs 78s :) it looks like a drastic performance-increase 馃槃

Nice! Take a look... instaled successfully on centos 6:

yum install python-devel

wget http://download.scopserv.com/dist6/packages/fail2ban/fail2ban-0.10.3.1-1.scopserv.src.rpm

rpmbuild --rebuild fail2ban-0.10.3.1-1.scopserv.src.rpm

/etc/init.d/fail2ban stop

rpm -Uhv /root/rpmbuild/RPMS/noarch/fail2ban-0.10.3.1-1.scopserv.noarch.rpm

/etc/init.d/fail2ban start

Just by the way: don't know for which purposes you would need python-devel here.
IMHO, python (and some modules if needed, like python-systemd or pyinotify) would be enough.

Just by the way: don't know for which purposes you would need python-devel here.
IMHO, python (and some modules if needed, like python-systemd or pyinotify) would be enough.

centos 7 use systemd, centos 6 nope.....

Was this page helpful?
0 / 5 - 0 ratings