Uassets: Bad regex in filters.txt

Created on 27 Nov 2020  路  23Comments  路  Source: uBlockOrigin/uAssets

There is a regex in filters.txt on line 24380 and it's not clear what the regex filter is doing but it is randomly blocking legit scripts on different websites that I visit.

/^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p

There is no comment explaining what it does or why it is there.

I propose that it be removed.

Please advise.

Most helpful comment

I decided I will add match-case, but it will be meaningful only for regex-based filters.

All 23 comments

but it is randomly blocking legit scripts on different websites that I visit.

Example ?

reason: https://github.com/easylist/easylist/issues/6476

Provide ALL the pages you found where legitimate stuff is blocked

Okay, here's one where there should be an ecard widget displaying on the page but it's being blocked arbitrarily by this mess of a regex. There will be many unintended consequences for a regex like this.

Here's the site:
https://americanpatriotsunsung.com/thank-a-vet/

@Yuki2718 first FP

@timsayshey Sincere apology for inconvenience. It was added to beat revolving ad scripts but I now see it should be adjusted more carefully.
@mapx- Can you turn it off temporary?

@timsayshey Hi, can you give us some more examples of breakage? Fixing the regex to avoid the breakage on americanpatriotsunsung will be possible, but I want to ensure the fix doesn't break other pages. The thing is these ad server the filter targeted changes so frequently that chasing them is not trivial.

@Yuki2718 test this 1:

/^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/([a-z]+[0-9]+[A-Z]).{11,17}\/\d{4,5}/$script,3p

@mapx- Sorry, I have a working solution of /^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/(?=[0-9A-Z]{0,16}[a-z])(?=[0-9a-z]{0,16}[A-Z])[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p it's best because doesn't care which order they appear, and also there's a case not including any number. Just waiting the OP's example of other breakage.

yours does not work:

image

@mapx- You're right, and apparently the reason is somehow uBO's regex interpretation is case-insensitive. I also observed the case insensitivity on https://github.com/easylist/easylist/issues/6537 Pinging @gorhill for a possible bug.
As an evidence, /^https?:\/\/[a-z]{7,15}\.(?:com|pw)\/(?=[0-9A-Z]{0,16}[a-z])(?=[a-zA-Z]{0,16}[0-9])[0-9a-zA-Z]{11,17}\/\d{4,5}$/$script,3p
(includes at least one number, instead of capital letter) correctly allow the script.

So, again, did you test my filter ?

Yes, that in turn misses some ad scripts. e.g. http://tamilyogimovie.co/ misses http://hubmaydaybrow.com/fEzlGv0Im0sb/27969.

it seems yours is good, some bug in uBO, tested your on regex101 is ok, see the internal discussion

for a possible bug

It's not a bug, that's by design. All matching is meant to be case insensitive. There used to be a match-case filter option, which uBO never implemented, but it was also dropped by ABP eventually.

So what is the purpose of that rather broad filter?

If it's meant to address something on tamilyogimovie.co, then please narrow it to that site.

There are legit scripts (all chars lowercase) and "bad" ones containing lowercase, uppercase and numeric chars (there is some case with only lowercase and uppercase). The logger is presenting the real situation.

The filter above is correct for the real world but not in the case of first replacing upper by lower case (as in uBO's case). When - for example - we have (bad script) lower+upper case will have no means to distinguish legit by crap scripts (all lowercase)

and "bad" ones containing lowercase

But which sites are using these bad ones? Surely we shouldn't assume they can be present on all sites?

It's a large category of such crap:
https://github.com/easylist/easylist/issues/6476

13x4.com/view/Rsj9ZdxbHk => crap: https://beakedpissod.com/r8lEgmDzyqTglNtG/17604
http://tamilyogimovie.co/ => http://hulkflugarb.com/rWiSeTtqQ9d/27967 and http://hubmaydaybrow.com/fEzlGv0Im0sb/27969
https://uptomega.me/49rf3ow09i12 => https://ledmophemp.com/rnkHzsMoBCNP5k/12790

https://1filmy4wap.com => https://emolapnay.com/rlnzLzvjGRKBY/20600

Last case without numeric chars

It's still a limited set of sites, there are filters with domain= option with hundreds of hostnames in EasyList. Such risky broad filters meant to apply on a limited set of sites should be narrowed to those sites only, I consider it's worst to risk breaking a lot of legitimate sites than to miss blocking on a limited set of fishy sites. The kind of breakage reported here makes uBO look bad and makes it more difficult to argue that uBO is an install-and-forget blocker.

Now regarding the case insensitivity issue, if it's something _really_ needed, then this should go into a new issue -- I may choose to support a match-case option but that would be only for regex-based filters.

but it was also dropped by ABP eventually

Source: https://issues.adblockplus.org/ticket/7318/

But I see now that it seems they never went ahead with this change, so apparently ABP still support match-case.

I won't open issue at least for this, since there's no guarantee that the case sensitive filter doesn't cause any FP I think the mentioned approach of listing all the domains the filter is useful will be safer. But then we have denyallow, so probably no need for regex. It was worth trying though, as these sites come and go one after another.

However, EasyList issue 6537 will be worth remembered. TL;DR is that an EL filter
/^https?:\/\/.*\/.*sw[0-9._].*/$domain=spankwire.com
wrongly trapped a benign request which happened to include sW. It was fixed by narrowing the scope to $script,xhr but such case can occur even in right scope. All filter authors know matching is case-insensitive but probably tend to assume regex is an exception, in fact EL and reginal lists often put A-Z along with a-z. Problems like this will still be prevented if regex is created carefully though.

I decided I will add match-case, but it will be meaningful only for regex-based filters.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

krystian3w picture krystian3w  路  3Comments

Jose1971AB picture Jose1971AB  路  3Comments

JulianNorton picture JulianNorton  路  3Comments

BurungHantu1605 picture BurungHantu1605  路  3Comments

pepablock picture pepablock  路  4Comments