I noticed that badbots isnt catching much.. digging into why I see that its because its using data from 2013.
https://github.com/fail2ban/fail2ban/blob/0.11/config/filter.d/apache-badbots.conf#L22
Is there any plans to update badbots (run gen_badbots)?
looks like gen_badbots needs updating.. running it I see that badbots.tmp results in:
Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|8484 Boston Project v 1\.0|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|8484 Boston Project v 1\.0|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|8484 Boston Project v 1\.0|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B|8484 Boston Project v 1\.0|Atomic_Email_Hunter/4\.0|atSpider/1\.0|autoemailspider|bwh3_user_agent|China Local Browse 2\.6|ContactBot/0\.2|ContentSmartz|DataCha0s/2\.0|DBrowse 1\.4b|DBrowse 1\.4d|Demo Bot DOT 16b|Demo Bot Z 16b|DSurf15a 01|DSurf15a 71|DSurf15a 81|DSurf15a VA|EBrowse 1\.4b|Educate Search VxB|EmailSiphon|EmailSpider|EmailWolf 1\.00|ESurf15a 15|ExtractorPro|Franklin Locator 1\.8|FSurf15a 01|Full Web Bot 0416B|Full Web Bot 0516B|Full Web Bot 2816B
It looks like it repeats the same data 5 times, with the last line then cuts off early.
Looking at user-agents.org you see that per the site it was last updated years ago.. Updated: 10/29/2014 22:38h
so may not even be worth fixing the scraper for user-agents.org.
It might be better to look for something that is actually updated.
Found this, it has all the user-agents that I've been seeing that I want blocked:
https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/blob/master/_generator_lists/bad-user-agents.list
this generates a new badbots line (could update gen_badbots with it),
wget -q -O- "https://raw.githubusercontent.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/master/_generator_lists/bad-user-agents.list" | uniq | sed -e 's/\\ / /g' | sed -e 's/\([.\:|()+]\)/\\\1/g' | tr '\n' '|' | sed -e 's/|$//g'
I'll let you know in a few days how it works out.. but I suspect it will do much better than the old badbots list.
@thezoggy, @mitchellkrogza
Interested to make a PR?
Thanks @sebres and @thezoggy 👍 I've put extensive work into maintaining that list, now going on 16+ months and it's used on 2 projects Nginx Bad Bot Blocker and Apache Bad Bot Blocker
@sebres I'll look into a PR for this unless @thezoggy get's to it before me, I have a busy week ahead as it is 😁
noticed that the updated list didnt work last night after the changes because the exact usergent in the failregex is anchored to the end.. using word boundary now so it can match anywhere in the useragent and take care of more variants.
before, sysscan
doenst match in:
www.testing.info:80 155.94.88.58 - - [30/Oct/2017:14:57:33 +0000] "GET / HTTP/1.0" 200 183482 "-" "sysscan/1.0 (https://github.com/robertdavidgraham/sysscan)"
because its not sysscan/1.0 (https://github.com/robertdavidgraham/sysscan)
or syscan)
because of the regex matching in:
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
also found out that the list isnt useful using a case sensitive match.. so lets fix that as well.
failregex = (?i)<HOST> -.*"(GET|POST|HEAD).*HTTP.*(?:%(badbotscustom)s|%(badbots)s).*"$
using fail2ban-regex and testing, so far so good. working with mitchell to get his list optimized so we both benefit from it. https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/issues/50
will report back in a few days once Ive tested things a bit
May I ask where this is at, as far as a drop-in replacement for fail2ban apache-badbots?
doh, forgot all about this on my quest to purify my system from badbots. let me circle back to this
Cool!
also found out that the list isnt useful using a case sensitive match.. so lets fix that as well.
failregex = (?i)<HOST> -.*"(GET|POST|HEAD).*HTTP.*(?:%(badbotscustom)s|%(badbots)s).*"$
@thezoggy This regex takes AGES to process a log file of 100Mb. More than 30mn !!!
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*(?:%(badbotscustom)s|%(badbots)s).*"$
= 1mn
(?i)
is just making it case in-sensitive.
normally you are processing file in real time and a line at a time.. so the time difference is pretty tiny.
but lets see how bad it is on a large file as you noted.
using a low powered vps,
System Info
-----------
Processor : Intel(R) Xeon(R) CPU E3-1240 v3 @ 3.40GHz
CPU Cores : 4
Frequency : 800.000 MHz
Memory : 5120 MB
Swap : MB
Uptime : 76 days, 19:16,
OS : Ubuntu 16.04.6 LTS
Arch : x86_64 (64 Bit)
Kernel : 2.6.32-042stab120.11
grabbed apache log trimmed down to 500M with about ~3 million lines:
$ wc -l apache.log
2974566 apache.log
$ ls -alh apache.log
-rw-rw-r-- 1 zoggy zoggy 500M May 11 16:17 apache.log
now, processing with (?i)
- case-insensitive:
$time sudo fail2ban-regex --print-all-matched apache.log /etc/fail2ban/filter.d/apache-badbots.conf
Results
=======
Failregex: 234 total
|- #) [# of hits] regular expression
| 1) [234] (?i)<HOST>.....
Ignoreregex: 39 total
|- #) [# of hits] regular expression
| 1) [39] .*(\/search\?q=).*
`-
Lines: 2974566 lines, 39 ignored, 234 matched, 2974293 missed
[processed in 7045.35 sec]
real 117m26.368s
user 117m1.730s
sys 0m2.598s
and now again with case sensitive (?i)
removed:
$ time sudo fail2ban-regex --print-all-matched apache.log /etc/fail2ban/filter.d/apache-badbots.conf
Results
=======
Failregex: 232 total
|- #) [# of hits] regular expression
| 1) [232] <HOST>...
`-
Ignoreregex: 39 total
|- #) [# of hits] regular expression
| 1) [39] .*(\/search\?q=).*
`-
Lines: 2974566 lines, 39 ignored, 232 matched, 2974295 missed
[processed in 4265.03 sec]
real 71m7.111s
user 70m29.744s
sys 0m8.610s
so you can see its approx 60% slower when doing case insensitive with the regex mode.
https://stackoverflow.com/questions/32010/is-regex-case-insensitivity-slower
hey guys I would love to jump in to this matter with you all. I've optimized the regex a bit so it should be quite faster. Can you check with the logs you used before ?
failregex = ^<HOST> - - \[\] "(GET|POST) /(\S+?)? HTTP/(1\.[01]|2)" \d{3} \S+? "\S+?" ".*(?i)(%(badbots)s|%(badbotscustom)s)
EDIT: hope noone went live just yet. The final part of regex matched bot list against referal string and hit some false positives. Corrected now.
bump. Why is semrushbot not included in badbots definition by default? How about MJ12bot? Or [anything]bot.com in the headers? That is an attack! It should be blocked by default!
Can we get an updated list, here in this thread, and also into a PR so that fail2ban actually starts, you know, banning modern bot attacks?
i think this should not only match the exact useragent
"Mozilla/5.0 [en] (X11, U; OpenVAS-VT 9.0.3)"
vs "OpenVAS"
i think this should not only match the exact useragent
"Mozilla/5.0 [en] (X11, U; OpenVAS-VT 9.0.3)"
vs"OpenVAS"
Put the 'exact' ones you want to match in the badbotscustom then.
useragents are so easy to change that you should match as unique but vague enough to be effective. If you were to be overly specific then your regex list would be much longer and would require more resources.
Put the 'exact' ones you want to match in the badbotscustom then.
That sounds strange... A humongous unmaintainable conf-file would be the result.
I would NOT like to match 'exact' but the current failregex does that.
I changed it..
# failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*(?:%(badbotscustom)s|%(badbots)s).*"$
which does the job for me but i don't understand why this is not the dafault.
I created a custom "apache-adminrsc" filter to catch all the hits on my web server to things that I do not host, so for example I would add "wiki" to it if I noticed recurring hits on that (one-offs I don't bother with).
Together with badbots, it seems to cut down a lot of stuff. The key is to have a short fuse ("apache-adminrsc" is a one-hit ban), and a lengthy ban (ditto, 24-hours).
From: André Saage notifications@github.com
Sent: Sunday, May 24, 2020 11:29 AM
To: fail2ban/fail2ban fail2ban@noreply.github.com
Cc: David Smith dsmith@dewarcom.com; Comment comment@noreply.github.com
Subject: Re: [fail2ban/fail2ban] apache-badbots -- update? (#1950)
Put the 'exact' ones you want to match in the badbotscustom then.
That sounds strange... A humongous unmaintainable conf-file would be the result.
I would NOT like to match 'exact' but the current failregex does that.
I changed it..
failregex = ^
which does the job for me but i don't understand why this is not the dafault.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHubhttps://github.com/fail2ban/fail2ban/issues/1950#issuecomment-633248000, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACJ7VVRAOV2BLPLZPQBPE7DRTE4OBANCNFSM4EBJ4ZVQ.
I had a problem with this filter (apache-badbots.conf) failing to filter a newly added bot called zgrab after the attacker had increased the request size.
There were three fixes that I implemented to stop this:
i:) add the filter to the _apache-badbots.conf_ file _exactly_ as found in the log file, _ie_: Mozilla/5.0 zgrab/0.x - even escaping the slashes will invalidate the pattern.
fail2ban-regex --print-all-matched /var/log/httpd/access_log /etc/fail2ban/filter.d/apache-badbots.conf
...will tell you whether or not the filter is working or, at least, what it is actually hitting on.
ii:) try modifying _apache-badbots.conf_ entry to something more along the lines of:
failregex = ^<HOST> -.*"(GET|POST|HEAD).\/.HTTP\/1\.1".(\d{3}).(\d{3}|\d{4})."-"."(?:%(badbots)s|%(badbotscustom)s)"$
The above 'fix' matches the request size regardless of whether it is 3 digit (usually blocked) or 4 digit (never blocked).
For checking the regular-expression do not use the usual site that comes up under searches for "_fail2ban regular expression checker_", - https://regexr.com/ is a brilliant, well-recommended (and working) solution.
iii) add a rewrite conditional entry to _httpd.conf_ in the Apache configuration files, but ensure that this is entered under the host, or virtual host, entry or entries.
The following is a two-part solution that blocks HTTP 1.0 requests and also serves to completely block zgrab user agent requests:
RewriteEngine On
RewriteCond %{THE_REQUEST} !HTTP/1.1$ [OR]
RewriteCond %{HTTP_USER_AGENT} zgrab [NC]
RewriteRule .* - [F]
Most helpful comment
noticed that the updated list didnt work last night after the changes because the exact usergent in the failregex is anchored to the end.. using word boundary now so it can match anywhere in the useragent and take care of more variants.
before,
sysscan
doenst match in:because its not
sysscan/1.0 (https://github.com/robertdavidgraham/sysscan)
orsyscan)
because of the regex matching in:failregex = ^<HOST> -.*"(GET|POST|HEAD).*HTTP.*"(?:%(badbots)s|%(badbotscustom)s)"$
also found out that the list isnt useful using a case sensitive match.. so lets fix that as well.
failregex = (?i)<HOST> -.*"(GET|POST|HEAD).*HTTP.*(?:%(badbotscustom)s|%(badbots)s).*"$
using fail2ban-regex and testing, so far so good. working with mitchell to get his list optimized so we both benefit from it. https://github.com/mitchellkrogza/apache-ultimate-bad-bot-blocker/issues/50
will report back in a few days once Ive tested things a bit