Fail2ban: Extreme high memory usage by too large IP-list (500K .. 1M IP's)

Created on 26 Apr 2018  路  32Comments  路  Source: fail2ban/fail2ban

I try to setup a fail2ban configuration with expected 500.000 - 1M banned ips. unfortunately all versions downloadable so far simply explode with ram usage on a box with 8G ram. So I tried to find some solutions and came across several posts with new options like:

nametoip_cache=
iptoname_cache=
garbage=

Unfortunately there is zero doc how to use these. Consequently the only thing I get is:

ERROR: File contains no section headers.
file: /etc/fail2ban/fail2ban.local, line: 1
'nametoip_cache=\n'
 Init of command line failed

I was not able to find a demo configuration with ALL parameters at least named in their corresponding sections. What does "threshold" in terms of garbage mean?
The DB usage seems also quite flakey, my db grew to:
-rw------- 1 root root 49871447040 Apr 25 12:37 fail2ban.sqlite3
which I tend to say is odd. Even if it contained 1M IPs and some dates it should be a LOT smaller. Is there a way to strip the db input to an absolut minimum?
And what about other options on these:

# use memory for the transaction logging:
journal_mode = memory
# temporary tables and indices are kept in memory:
temp_store = memory

How to tell _not_ to use memory?

Most helpful comment

How are we able to shorten the db in this use case and minimise ram usage?

dbpath='/var/lib/fail2ban/fail2ban.sqlite3'
# purge:
?sudo? python -c "dbpath='$dbpath'; import sys, logging; logging.basicConfig(stream=sys.stdout, level=logging.DEBUG); from fail2ban.server.database import Fail2BanDb; db = Fail2BanDb(dbpath); db.purge()"
# vacuum to shrink size:
?sudo? sqlite3 "$dbpath" 'VACUUM;'

will remove all old entries.

If you want us to test 0.11, where can we download that?

0.11.zip or 0.11.tar.gz

Note:

[honeypot]
bantime = 2h
# incremental enabled:
bantime.increment = true
# factor 12 means the ban time grows by 12 for each next ban, so 2h, 1d, 12d, ...
bantime.factor = 12
# ban time grows up to 30 days:
bantime.maxtime = 30d
...
  • you could also increase initial value of maxretry (to avoid sporadic ban of some "good" IPs) - for IPs already known as "bad", fail2ban shrink maxretry with each new attempt (after each ban);
  • per default the downloaded from github fail2ban version is debian-based, so you should check/adjust your configs (compare distribution-related paths), at least following should be changed to proper value for your distribution:
    https://github.com/fail2ban/fail2ban/blob/ffd6b9f6de3b94a255d145017ed6efb3a5c9e79c/config/jail.conf#L36
  • if you'll start with this new version, I would possibly start with empty database (to avoid possibly too large db-mirgation) or at least purge the database (and vacuum).

All 32 comments

500.000 - 1M banned ips

Persistent or very long ban-time? Are you really need it? Because normally if some IPs get banned for several days, they come never or very rarely back.

nametoip_cache and iptoname_cache ...

Both does not exists, but you don't need to worry about it, because both caches could store maximal 1000 entries (and hold entries for last 5 minutes). There are interim cache, just to avoid multiple dns access for same IP by each failure.

garbage

It will be called explicitly (and the auto-garbage is also on in the latest version 0f 0.10/0.11).

The DB usage seems also quite flakey, my db grew to 49Gb.

Well it looks like dbpurge issue (#1267), see https://github.com/fail2ban/fail2ban/issues/2045#issuecomment-364564285 how you could manually start purge-process.
If your log-ratate works fine and the logs are not too large, you could also delete this file (latest log-position of each jail is stored there also).

Even if it contained 1M IPs and some dates it should be a LOT smaller

It holds not the IPs only, but all the info of each failure (log-line, time of ban, ban-time, etc.).
And if dbpurge does not work, you've there all the tickets since start of the fail2ban.

journal_mode and temp_store
How to tell not to use memory?

Currently are still not implemented (#1437), so impossible to disable it, but we've very short transactions and sqlite does this under observing of current memory usage, so will flush often if necessary.
Anyway I think it would be a wrong screw for your case...

Because ATM, fail2ban holds all the failures (matched log-lilnes) together with other IP-related information in his fail-manager (each failure for findtime seconds) and ban-manager lists (as long as the ban continues).
This could be rather a reason for such memory consumption (and I would like to fix it earlier).

As interim solution: try 0.11 with bantime.increment = true (and without persistent bans).
You'll see the count of IPs will be reduced distinctly.

Thank you very much for such a quick answer. Let me explain what is the idea
behind the banning:

we are observing some ips and some ports on every one of them. The jails
should explain:
(the important field are shown)

[honeypot]
logpath = /log/honeypot.log
maxretry = 1
bantime = 7200
findtime = 7200

[honeypot-repeater]
enabled = true
logpath = /log/fail2ban/fail2ban.log
maxretry = 2
bantime = 86400
findtime = 86400

[honeypot-dead]
logpath = /log/fail2ban/fail2ban.log
maxretry = 3
bantime = 7776000
findtime = 7776000

Which means:
First strike banned for 7200 secs.
Second time banned for 86400 secs (1 day)
Three times banned for 1 day banned for quite long :-)

So we need a long lasting db, but we are not interested at all in the logs
that led to the ban because we know why.
How are we able to shorten the db in this use case and minimise ram usage?

We cannot find any hints about this, if you want us to test 0.11, where can we
download that?

Regards

On Thu, 26 Apr 2018 10:08:21 +0000 (UTC)
"Sergey G. Brester" notifications@github.com wrote:

500.000 - 1M banned ips

Persistent or very long ban-time? Are you really need it? Because normally
if some IPs get banned for several days, they come never or very rarely back.

nametoip_cache and iptoname_cache ...

Both does not exists, but you don't need to worry about it, because both
caches could store maximal 1000 entries (and hold entries for last 5
minutes). There are interim cache, just to avoid multiple dns access for
same IP by each failure.

garbage

It will be called explicitly (and the auto-garbage is also on in the latest
version 0f 0.10/0.11).

The DB usage seems also quite flakey, my db grew to 49Gb.

Well it looks like dbpurge issue (#1267), see
https://github.com/fail2ban/fail2ban/issues/2045#issuecomment-364564285 how
you could manually start purge-process. If your log-ratate works fine and
the logs are not too large, you could also delete this file (latest
log-position of each jail is stored there also).

Even if it contained 1M IPs and some dates it should be a LOT smaller

It holds not the IPs only, but all the info of each failure (log-line, time
of ban, ban-time, etc.). And if dbpurge does not work, you've there all the
tickets since start of the fail2ban.

journal_mode and temp_store
How to tell not to use memory?

Currently are still not implemented (#1437), so impossible to disable it,
but we've very short transactions and sqlite does this under observing of
current memory usage, so will flush often if necessary. Anyway I think it
would be a wrong screw for your case...

Because ATM, fail2ban holds all the failures (matched log-lilnes) together
with other IP-related information in his fail-manager (each failure for
findtime seconds) and ban-manager lists (as long as the ban continues).
This could be rather a reason for such memory consumption (and I would like
to fix it earlier).

As interim solution: try 0.11 with bantime.increment = true (and without
persistent bans). You'll see the count of IPs will be reduced distinctly.

--
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/fail2ban/fail2ban/issues/2118#issuecomment-384586206

How are we able to shorten the db in this use case and minimise ram usage?

dbpath='/var/lib/fail2ban/fail2ban.sqlite3'
# purge:
?sudo? python -c "dbpath='$dbpath'; import sys, logging; logging.basicConfig(stream=sys.stdout, level=logging.DEBUG); from fail2ban.server.database import Fail2BanDb; db = Fail2BanDb(dbpath); db.purge()"
# vacuum to shrink size:
?sudo? sqlite3 "$dbpath" 'VACUUM;'

will remove all old entries.

If you want us to test 0.11, where can we download that?

0.11.zip or 0.11.tar.gz

Note:

[honeypot]
bantime = 2h
# incremental enabled:
bantime.increment = true
# factor 12 means the ban time grows by 12 for each next ban, so 2h, 1d, 12d, ...
bantime.factor = 12
# ban time grows up to 30 days:
bantime.maxtime = 30d
...
  • you could also increase initial value of maxretry (to avoid sporadic ban of some "good" IPs) - for IPs already known as "bad", fail2ban shrink maxretry with each new attempt (after each ban);
  • per default the downloaded from github fail2ban version is debian-based, so you should check/adjust your configs (compare distribution-related paths), at least following should be changed to proper value for your distribution:
    https://github.com/fail2ban/fail2ban/blob/ffd6b9f6de3b94a255d145017ed6efb3a5c9e79c/config/jail.conf#L36
  • if you'll start with this new version, I would possibly start with empty database (to avoid possibly too large db-mirgation) or at least purge the database (and vacuum).

On Thu, 26 Apr 2018 05:00:40 -0700
"Sergey G. Brester" notifications@github.com wrote:

  • you should set bantime.increment = true in your jail.local;
  • you don't need both recidive jails repeater and dead anymore, simply provide following config values:
[honeypot]
bantime = 2h
# incremental enabled:
bantime.increment = true
# factor 12 means the ban time grows by 12 for each next ban, so 2h, 1d, 12d, ...
bantime.factor = 12
# ban time grows up to 30 days:
bantime.maxtime = 30d
...
  • you could also increase initial value of maxretry (to avoid sporadic ban of some "good" IPs) - for IPs already known as "bad", fail2ban shrink maxretry with each new attempt (after each ban);
  • per default the downloaded from github fail2ban version is debian-based, so you should check/adjust your configs (compare distribution-related paths), at least following should be changed to proper value for your distribution:
    https://github.com/fail2ban/fail2ban/blob/ffd6b9f6de3b94a255d145017ed6efb3a5c9e79c/config/jail.conf#L36
  • if you'll start with this new version, I would possibly start with empty database (to avoid possibly too large db-mirgation) or at least purge the database (and vacuum).

Ok, let me explain this point: we do not use fail2ban to actually ban anything. It only puts the ips in ipsets and only the repeater and dead ipsets are then exported as simple ascii ip lists and provided to the banning routers.

I've already made similar things.

only the repeater and dead ipsets are then exported as simple ascii ip lists and provided to the banning routers.

The question is why you cannot made it directly in honeypot jail (but initial after 3 attempts), see:

[honeypot]
maxretry = 3
bantime = 2h
# incremental enabled:
bantime.increment = true
# factor 12 means the ban time grows by 12 for each next ban, so 2h, 1d, 12d, ...
bantime.factor = 12
# ban time grows up to 30 days:
bantime.maxtime = 30d

In this case the intruder visited honepot 3 times (comparable with your previous config) and get banned (blacklisted/exported/whatever)...
In opposite to your previous config/version it will put "bad" IP early to the ban.
And you can just use longer time for your "banning routers", so it will be unbanned in fail2ban via ipset (but never reach it again, until still "banned" in your routers).

Also you'll have another opportunity to cumulate all "bad" IPs for your banning routers. Something like this:

# get "bad" IPs (banned more as once):
dbpath='/var/lib/fail2ban/fail2ban.sqlite3'
jail='honeypot'
minbancount=1
fail2ban-python -c "
dbpath='$dbpath'; 
import sys, logging;logging.basicConfig(stream=sys.stderr, level=logging.ERROR);
from fail2ban.server.database import Fail2BanDb; db = Fail2BanDb(dbpath); 
rows = db._db.cursor().execute('''select ip from bips 
  where jail = ? 
  and timeofban + bantime > cast(strftime('%s', 'now', 'localtime') AS int) 
  and bancount > ?''', 
  ('$jail', $minbancount))
print('\n'.join(map(lambda r: r[0], rows)))
" > /tmp/ips4ban-router.txt
echo IPs found: $(wc -l /tmp/ips4ban-router.txt)

On Thu, 26 Apr 2018 14:17:55 +0000 (UTC)
"Sergey G. Brester" notifications@github.com wrote:

I've already made similar things.

only the repeater and dead ipsets are then exported as simple ascii ip
lists and provided to the banning routers.

The question is why you cannot made it directly in honeypot jail (but
initial after 3 attempts), see: ```INI
[honeypot]
maxretry = 3
bantime = 2h

incremental enabled:

bantime.increment = true

factor 12 means the ban time grows by 12 for each next ban, so 2h, 1d,

12d, ... bantime.factor = 12

ban time grows up to 30 days:

bantime.maxtime = 30d
```

The thing is this: the honeypot gives an idea about the true number of hits
overall. Nevertheless we cannot use it for banning, because it may be fed with
false-positives by spoofing.
On the other hand spoofing becomes less significant the more retries there are.

In this case the intruder visited honepot 3 times (comparable with your
previous config) and get banned (blacklisted/exported/whatever)... In
opposite to your previous config/version it will put "bad" IP early to the
ban. And you can just use longer time for your "banning routers", so it will
be unbanned in fail2ban via ipset (but never reach it again, until still
"banned" in your routers).

Also you'll have another opportunity to cumulate all "bad" IPs for your
banning routers. Something like this: ```bash

get "bad" IPs (banned more as once):

dbpath='/var/lib/fail2ban/fail2ban.sqlite3'
jail='honeypot'
minbancount=1
fail2ban-python -c "
dbpath='$dbpath';
import sys, logging;logging.basicConfig(stream=sys.stderr,
level=logging.ERROR); from fail2ban.server.database import Fail2BanDb; db =
Fail2BanDb(dbpath); rows = db._db.cursor().execute('''select ip from bips
where jail = ?
and timeofban + bantime > cast(strftime('%s', 'now', 'localtime') AS int)
and bancount > ?''',
('$jail', $minbancount))
print('\n'.join(map(lambda r: r[0], rows)))
" > /tmp/ips4ban-router.txt
echo IPs found: $(wc -l /tmp/ips4ban-router.txt)
```

Ok, this is the maximum cpu time burning version you suggest. It is nowhere
near "ipset list" which needs almost no cpu time at all.
Come on ...
We want to be able to run it at all in terms of ram usage. We don't want an
additional problem regarding cpu time.
The logs you write into the database are useless most of the time (i.e. all of
the time in our case). Is there some way to not put it into the db file at all?
Really everything we are looking for is the ip, the bantime, the number of
retries. That's about it. For this case the db can easily be held in ram by
the system via buffering.

Ok, this is the maximum cpu time burning version you suggest.

I don't suggested you something what you don't meant - your self wrote that you use ipset for honepot jail, or I'm wrong?
If I correct understood you makes following:

  • jail honepot find 1 attempt from IP and "bans" it (you said ipset);
  • if it'll be banned second/third time (repeater and dead) then exported as simple ascii to your banning routers;

If so, then I don't understand why my suggested solution should be more CPU-intensive. Can you explain what all the jails really doing now?

The logs you write into the database are useless most of the time

You could simply disable the database: dbfile = None in Definition section of your fail2ban.local

We want to be able to run it at all in terms of ram usage.

As I already wrote, currently there is an "issue" with fill-info in fail/ban-managers (which eats memory), so almost impossible with mainstream fail2ban as is (not customized).
E. g. one thing you can do for example is to set self.maxEntries = 1 (ATM no config value for this) here:
https://github.com/fail2ban/fail2ban/blob/d3442742716e8253ca592831ae61b51305571fb1/fail2ban/server/failmanager.py#L46
Etc.

I've my own forks which I use for similar purposes as you (honeypot, tarpit, etc.). So I can try to back-port all this features (configurable/optional) to mainstream fail2ban. Unfortunately, no time permanently.

On Thu, 26 Apr 2018 16:08:35 +0000 (UTC)
"Sergey G. Brester" notifications@github.com wrote:

Ok, this is the maximum cpu time burning version you suggest.

I don't suggested you something what you don't meant - your self wrote that you use ipset for honepot jail, or I'm wrong?

The thing is: whereas we get the ip list from an "ipset list" command, you get it from inside your database (file) via select. This is for sure quite a lot more complex in processing.

If I correct understood you makes following:

  • jail honepot find 1 attempt from IP and "bans" it (you said ipset);

Correct.

  • if it'll be banned second/third time (repeater and dead) then exported as simple ascii to your banning routers;

They are put into two ipsets (rep and dead) and exported every 15 mins by "ipset list" (you do know the ipset tool?)

If so, then I don't understand why my suggested solution should be more CPU-intensive. Can you explain what all the jails really doing now?

Exactly the above.

The logs you write into the database are useless most of the time

You could simply disable the database: dbfile = None in Definition section of your fail2ban.local

But this way we would loose almost all information in case of a host failure. Remember that the history may go back 90 days ...
The db in itself is ok, only the information put into it is way to much.

We want to be able to run it at all in terms of ram usage.

As I already wrote, currently there is an "issue" with fill-info in fail/ban-managers (which eats memory), so almost impossible with mainstream fail2ban as is (not customized).
E. g. one thing you can do for example is to set self.maxEntries = 1 (ATM no config value for this) here:
https://github.com/fail2ban/fail2ban/blob/d3442742716e8253ca592831ae61b51305571fb1/fail2ban/server/failmanager.py#L46
Etc.

What does this mean? maxEntries = 1 compared to maxEntries = 50 as it is currently?

I've my own forks which I use for similar purposes as you (honeypot, tarpit, etc.). So I can try to back-port all this features (configurable/optional) to mainstream fail2ban. Unfortunately, no time permanently.

Really, as far as I understand the db all we need is not writing any logs from the regex parts to it...
Or not?

you get it from inside your database (file) via select

This was just an example (what you could do also)... BTW single select with 1M rows each 15 minutes is not what I understand as CPU time burning...

But Ok. I understood now, what you meant...

What does this mean? maxEntries = 1 compared to maxEntries = 50 as it is currently?

maxEntries = 1 means that the ticket (the info about failure, which fail-/ban-manager hold) contains only one last failure-entry (matched log-line), 50 meant up to 50. It's currently not possible to disable it completelly (set to 0).

Really, as far as I understand the db all we need is not writing any logs from the regex parts to it...

No, because the active ticket contains also current failures (matched-log entries) of the IP.
This as first should be made deactivatable to save the memory.
The DB plays a secondary role.

On Thu, 26 Apr 2018 16:40:22 +0000 (UTC)
"Sergey G. Brester" notifications@github.com wrote:

What does this mean? maxEntries = 1 compared to maxEntries = 50 as it is currently?

maxEntries = 1 means that the ticket (the info about failure, which fail-/ban-manager hold) contains only one last failure-entry (matched log-line), 50 meant up to 50. It's currently not possible to disable it completelly (set to 0).

Ok, what I don't understand: why do you drag the while matched log-line with the ticket and not only the match time/date?

Really, as far as I understand the db all we need is not writing any logs from the regex parts to it...

No, because the active ticket contains also current failures (matched-log entries) of the IP.
This as first should be made deactivatable to save the memory.

I guess it is exactly what we would need ...

The DB plays a secondary role.

Thanks for listening to our explanations. Hopefully the above can be made available ...

I tried your idea to shrink the database and it came out as expected by me:

INFO:fail2ban.database:Connected to fail2ban persistent database
'/log/fail2ban/fail2ban.sqlite3'
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.6/site-packages/fail2ban/server/database.py",
line 95, in wrapper return f(self, self._db.cursor(), args, *kwargs)
File "/usr/local/lib/python2.6/site-packages/fail2ban/server/database.py",
line 628, in purge (MyTime.time() - self._purgeAge, ))
MemoryError
SQL error: disk I/O error

Which means the code is not able at all to handle a database of that size.
I declare the whole database code as completely broken by design upto the
tried 0.10 version. You may be able to handle 5 fails a day on a ssh port, but
for serious usage it is useless. One can indeed wonder why you are using a
database at all for cases where the whole _needed_ data would fit into ram
anyway. I guess the whole idea of your database needs a deep rethink.

Please:

  • drop the log lines completely from the db
  • purge all unneeded (i.e. outdated) data from db in a configurable time line

Thank you.

Just to comment about 'Because normally if some IPs get banned for several days, they come never or very rarely back.' :)
I banned offenders for several days (5+), we had peaks of 500+ jailed ips.
And this morning after two weeks of intermittent problems, I implemented a permanent ban,
users had weird behaviors in a critical web app again.
Just looked at the sshd jail after a few hours of permanent ban today.
More than 13,000 entries in the sshd jail. Oupse.
We don't have control on the firewall to change the ssh port and alas ssh access is required
for support issues. So far so good (0.9 something), no excessive memory consumption
and if more is required, we can change the configuration of the virtual server accordingly.
Next step if we can't contain this will be to move away from the default port but that may not be
a bullet proof solution depending on the firewall in place.

It sucks... several ips were from India were I suspect many computers are not properly protected
and given the size of IT businesses over there, recruitment of bots is probably relatively easy.
(Sight)

And? What do you trying to tell us?

More than 13,000 entries in the sshd jail.

Well, it is indeed a lot, but your idea with persistent banning makes it actually more worse - this number would grow with each IP going into persistent ban (and this list will never be smaller, unless manually unbanned or flushed).
Whereas using of an incremental ban, solely "bad" IPs would stay in this list in the long term (all others disappear if they don't repeat new attempts).

Just by the way.
The issue "high memory consumption" is still there and would be resolved soon. Be patient.

I disagree, after 2 weeks we never caught enough intruders to get rid of this DOS.
I initially set 'detention' at 1 day, then 3 days. Still having problems.
Before that with a one day jail time, it peaked at 450, 3 days, 1300.
And now with that permanent ban, it peaked at more than 13,000 in less than a day.
There are less than 5 individuals allowed to use SSH on that server...
The size of the army of bots probably increased to try to overcome the shorter jail time.

How do you expect to discourage such attacks if you don't grab all the bots at once ?
We experienced severe denials of service for two weeks because of the sheer size of this 'army'.
I am not convinced that increasing this to 5 days would have changed the issue much.

sorry if that was already discussed:
among those 13,000 IPs, how many of the /24 networks? any overlap with legit IPs? I just wonder if may be banning whole subnets would be Ok for you?

That would require a significant effort to analyze. I don't think that the IP list will increase much more.
So far it's been stable around 13,800 public IP addresses.
Either they will launch another division in the following weeks or they'll stop.
If that number doesn't increase in the next month, it will most probably be the end
of the DOS.
My sole goal is to discourage them... permanently :)
My customer offers some potential to steal intellectual property. The bots chose a bad entry point
to try to break in but they are not managed intelligently. I hope that in a month the perpetrators
will realize that this server is impenetrable.

Now is 13800 a huge 'army' of bots ? I have no idea. As I see it aside from a permanent ban,
there's not much you can do to:

a) size the swarm trying to break in.
b) limit the effects of the DOS by jailing the bots as fast as possible (kind of a Blitzkrieg strategy) and not allowing them to return battling to the front after a few days.

This could be going for weeks/months otherwise. You can't just hope that the infected systems
will be sanitized by support pple all over the place. Especially if PCs are infected in areas
were anti-viruses are not popular or if they run older versions of Windows w/o updates.

My use case is easy, the service being attacked is restricted to a very small user population.
Dealing with attacks on a service used by a broad group of users geographically dispersed is something
else.

Obviously half a million of banned addresses is a lot to manage and can put stress on the resources used by fail2ban.
I am not there yet but I suspect that this DOS will not be the last one, we had a few small ones were
the defaults were just fine.
I just hope that the next one can't mobilize hundred of thousands of bots.
This stupid brute force approach should have a maximum lifespan but it depends entirely on the axxhole(s) driving the attacks.

Don't ban complete subnets. You harm people in dialup nets where few low-skilled people are online.
Anyway this thread is about big numbers of bans far beyond 50.000 to 100.000. This is where the real troubles begin with fail2ban. Ban times of 90 days or more are usual. Most IPs are part of botnets and have high retry rates.
Under these circumstances the database is useless because it is filled mostly with garbage. So it has to be turned off. Next the socket close problem arises (mentioned in another thread). Very likely the memory map for the banned ips has problems, too. Either it eats up mem, or it is too filled with unneeded infos not useful in such situations. That are the real-life experiences up to now.
Maybe 0.11 will help and fix it?

Thank you for the hints. I'll have a basis to start with if a huge pile of s..t bots ends up hammering
one of the servers I maintain, like if my job was not complicated enough by itself :(
Suddenly I feel lucky today, 13800 bots ? Bof, piece of cake :)

That would require a significant effort to analyze

Not really of all the logs are available... Pretty much could be done with a set of grep/awk commands. But that was just an idea.

Another more permanent measure I use on some servers is knocking, if there is some other external service I could monitor, eg website. I then ask users first to go to some "secret" url first, fail2ban is listening on the Apache log and opens ssh port for that ip as its banning action ;-)

I will keep that idea of secret url handy.
Than you.

Don't ban complete subnets. You harm people in dialup nets where few low-skilled people are online.

I do appreciate the danger here. And in some cases it wouldn't be appropriate.
But if the legit users pool is not too broad geographically, I think it might be interesting to see how many of small /24 subnets constitute the botnet. 13,000 ips is a tiny fraction of even ipv4 space. So if there is any clusters within the botnet (eg there is less than 4k of unique /24 networks), I would consider banning them instead to reduce fail2ban and firewall load 3 fold. I really don't think would be any overlap between botnet subnetworks and legit users.
On some servers I just permanently ban even /16 where I know that my users would never come from those (eg .can) networks. But my use cases aren't large deployments to be honest.

I'm not sure if this is the right issue to comment on or if I should start a new one. I'm setting up fail2ban because of a large number of attacks against one of my servers. However in doing this I've noticed the profile of attacks has changed since I last reviewed it.

A older attack might have (for example):

for every IP in a very large list
    for every username in a common list of names
        try to connect to IP as username with blank password

I'm now seeing evidence of this:

for every username in a common list of names
    for every IP in a very large list    
        try to connect to IP as username with blank password

It's the same attack with the same risk, costing the same resources, but the "bruit force" is happening over a number of months no minutes. I'm seeing around 500 of these requests a day, but each IP is touching my server maybe once per week.

I think it relevant to this ticket because if I want to use fail2ban to assist with blocking these I'm going to need some very long bans and long timeouts on filters. If fail2ban caches too much in memory (including fail tickets) then the hackers have successfully found a way to circumvent fail2ban's usefulness.

It is relevant, indeed.
And the issue will be processed (I don't ignore that, just permanently busy (no time to finish).

Something like this just goes to the logrotate, every day:

sqlite3 /var/lib/fail2ban/fail2ban.sqlite3 "DELETE FROM bans WHERE date('now', '-2 day') > datetime(timeofban, 'unixepoch'); VACUUM;"                                            systemctl restart fail2ban

@couling

"I'm seeing around 500 of these requests a day, but each IP is touching my server maybe once per week."

What's the point of bruteforcing then?

'hackers have successfully found a way to circumvent fail2ban's usefulness'

With 1 attempt of bruting the password from 1 IP per week?

@reardenlife it's exactly the same attack with the same chance of success and the same load on the system. An attempt per week from many different sources accumulating to 500 per day carrys with it almost exactly the same risk and potential damage to my system as 500 per day from the same source. It's just that blocking it is so much harder.

@couling

"An attempt per week from many different sources accumulating to 500 per day carrys with it almost exactly the same risk and potential damage to my system as 500 per day from the same source."

Theoretically? Yes.
But the practical goal is not to the lowering the chances of getting the password bruted at all costs, but lowering them good enough so that password will not be bruted in the real world.

Likewise, theoretically, smoking the tobacco increases the chances of getting a lung cancer, but practically, a few cigarettes per week will not do any harm. :)

The purpose of fail2ban is to lower the pace of bruting attempts from one IP. It does its job good and there is no point to build something on top of it.

@reardenlife I am not asking to build something on top of it please be careful to read my comments in context of the issue they are on.

I have simply tuned fail2ban to ban based on the profile of attacks present in the real world and discovered that this will blow up memory usage.

If you truly believe that an attack conducted over years is less dangerous than the same attack conducted over hours then good luck to you.

@couling

I have simply tuned fail2ban to ban based on the profile of attacks present in the real world and discovered that this will blow up memory usage.

I do agree that the author have to fix the memory usage issue.
I happened to notice such a memory leak purely accidentally. Yet I wonder what would happen if I leave the server for a few months? fail2ban would probably grew up and a webserver or DB will likely crash due to memory insufficiency. This is sad and it has to be fixed. No question about that.

"If you truly believe that an attack conducted over years is less dangerous than the same attack conducted over hours then good luck to you."

What I am talking about is a probability of password being bruted provided an attacker uses 1 IP brute attempt per week. I calculated that it would probably take him about 2.000 years with botnet of 1 billion machines provided the target password is about 8 characters long.

So it doesn't matter if he will conduct an attack within hours or years. He just will not get it.

Likewise, one can smoke a few cigarettes per week without getting a lung cancer simply because one will die sooner from causes other than the lung cancer. :)

@reardenlife this Math is deeply flawed. So is your health advice. Password entropy isn't the issue, people would just increase to 10 character passwords if it was. And frankly lung cancer is a horrible way to go. The increased risk from one per week is much higher than you calculate. With that I see no reason to continue this conversation further.

@reardenlife
you really have not understood the profile of the attach talked about. We are not talking about an attack type with one try per week. We are talking about a distributed attack with a low repetition of attacking IPs. Which means the _same_ attack is using a lot of different IPs but with very low re-use of the same attacking IP. Understand?
Generally fail2ban was probably designed with the idea that someone with one IP tries to hack your ssh or some other service. But this is not the major possible use case these days. Our filters fed by fail2ban configs drop 35.000 packets a _minute_ coming from around 250.000 different IPs. The setup peaked at 214.000 drops a minute this week. We are not interested in a history of loglines per IP in some database. We only need a last occurence and probably something like occurence per month rating.

I'll close it as fixed in #2402 (to avoid large memory consumption one could set both options to 0, see PR for config example).
Now merged in 0.10 (and I'll still merge in 0.11 today).
Although it is only the first shot but it should work. I have another WiP branch which is more resource-sparing but still in development (because it completely rewrites several modules of fail2ban it has for example miserable test-coverage or even still fails in some tests, etc). I'll try to provide it before 0.11 gets released.

Merged in 0.11 too.
Please test, any feedback's are welcome...

Was this page helpful?
0 / 5 - 0 ratings

Related issues

DazzlerJay picture DazzlerJay  路  7Comments

twixi picture twixi  路  5Comments

electrofloat picture electrofloat  路  3Comments

Skyridr picture Skyridr  路  3Comments

wienfuchs picture wienfuchs  路  5Comments