Https-everywhere: Create `fetch-whitelist.txt`

Created on 8 Jun 2017 · 24Comments · Source: EFForg/https-everywhere

Currently, ruleset-coverage-whitelist.txt is used for both the fetch test and the rules (coverage, formatting) test.

There are some instances where you might expect the fetch test to fail, but the rules test should still pass. For instance, https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/Bitly_branded_short_domains.xml contains so many domains that the probability of momentary failure for any one of them causing a failure in the fetch test itself is high. Another instance is domains contained in country-specific rulesets, not accessible via our testing infrastructure.

This raises a problem. Conceptually, the two tests should have separate whitelists. Practically, when we go and run the comprehensive https://github.com/EFForg/https-everywhere-full-fetch-test, we will be disabling rules which are not whitelisted because they shouldn't bypass the coverage test.

The way to fix this is, for right now, simply copying the current ruleset-coverage-whitelist.txt to a new file, fetch-whitelist.txt and and appending those files that should be bypassing the fetch test to it.

ruleset-testing

Source

Hainish

All 24 comments

I'm having trouble wrapping my head around the problem you're trying to solve and the solution you're using to solve it.

For the Bitly-type situation, how about make it so that any ruleset with more than some N number of rules cannot be disabled by https://github.com/EFForg/https-everywhere-full-fetch-test?

The problematic country-specific rulesets are AFAIK all in China and these are already default_off with some appropriate comment.

I feel like the whitelist should be considered a temporary hack to prevent failures on old, complicated, unreviewed rulesets, and we should work on eliminating the list rather than doubling down on it.

jeremyn on 8 Jun 2017

@jeremyn there are edge-cases (as noted here) where a fetch test will suffer a failure, even though we know the rule works, due to discrepancies between curl and the functionality of the browsers.

Ideally, we'd be able to perform these tests with an actual browser, driven by selenium. Practically, this is difficult because the time required to test via selenium is siginificantly higher than curl. The instances of expected failure are relatively rare, and slowing down our tests for the sake of ensuring in these rare cases the fetch test passes is impractical.

Given this situation, it's best to enumerate the instances where these edge cases occur, rather than to merge them without any form of accounting. That would cause problems down the road, especially when we go and do the comprehensive https://github.com/EFForg/https-everywhere-full-fetch-test disabling of rules.

Hainish on 9 Jun 2017

Surely, this applies to a very low number of rules. I would much rather start with a empty whitelist that we slowly populate. Otherwise, we end up getting stuck with huge whitelists that take forever to be clear from false positives (like ruleset-coverage-whitelist.txt).

Also, how do you plan on cleaning this whitelist? I fail to think of an automated process for this. If this has to be manual, it's way more reasonable to start with a much smaller, more manageable list

Even if we start with an empty whitelist, I think it wouldn't be too hard to populate it quickly with the relevant rulesets. Such rulesets have often been causing trouble and discussions during the review process so contributors and reviewers probably remember them.

Bisaloo on 10 Jun 2017

@Hainish Thanks for your explanation.

Instead of having multiple lists, the existing list could become something like this:

File name,hash,auto-pass rules test,auto-pass fetch test
Example.com.xml,<hash>,X,X
Example.net.xml,<hash>,X,

An entry with no X is removed from the file. To start this CSV file, we take the existing ruleset-coverage-whitelist.txt and put X in both auto-pass columns.

The combination ,,X should never happen because passing the rules test is a prerequisite to attempting the fetch test. This means the auto-pass rules test column is redundant and we could remove it, however I think the extra column makes the file easier to reason about, even if it is less elegant.

I agree with @Bisaloo in that we need to minimize long lists of anything wherever possible. Lists are maintenance burdens.

(As a side-note, I've realized that the whitelist file is actually a blacklist. The logic is "run the tests on everything except the specific items we've enumerated", which is blacklisting. Or from another angle, the items in the list are known to be bad, not good.)

jeremyn on 10 Jun 2017

👍1

@cschanaj Thanks, and I agree that my earlier logic was flawed.

Neither rules nor fetch are prerequisites for each other. A rules failure might be something like insufficient tests, so fetch can still be expected to work. A fetch failure might be grammatically correct but have the curl problem described earlier, or a very large number of domains such that a connection problem on one of them is likely. So all three of these are possible:

File name,hash,auto-pass rules test,auto-pass fetch test
Example.com.xml,<hash>,X,X
Example.net.xml,<hash>,X,
Example.org.xml,<hash>,,X

jeremyn on 11 Jun 2017

👍1

@jeremyn Personally, I prefer the whitelist formated like

hash, auto-pass rules, auto-pass fetch, filename
<hash1>, Y, Y, Example.com.xml
<hash2>, Y, N, Example.net-expired.xml
<hash3>, N, Y, Example.org-wildcards.xml

for better readability and easier manual editing (though not strictly necessary). What do you think?

cschanaj on 11 Jun 2017

How about this:

filename,auto-pass rules test,auto-pass fetch test,hash
Example.com.xml,1,1,<hash1>
Example.net-expired.xml,1,0,<hash2>
Example.org-wildcards.xml,0,1,<hash3>

Reasons:

The file name should go first, it's the most important piece of information. Also, having it first makes the file easier to sort with simple tools.

The hash should go last, it's the least important piece of information.

I want to enforce the flag fields as booleans as much as possible. "Y" and "N" are too "string-like" for me. On the other hand leaving it blank raises the question, is it deliberately false or did someone forget the value? Enforcing a "0" or "1" is less ambiguous.

jeremyn on 11 Jun 2017

I prefer @cschanaj order because it doesn't mess up with alignment (hashes are always the same length, as opposed to file names). So the "raw" csv file (as in opened in a plain text editor) will be easier to read and update.

I agree booleans are better than strings or blanks.

Bisaloo on 11 Jun 2017

@jeremyn I agree that using 0 and 1 is less ambiguous. My argument for https://github.com/EFForg/https-everywhere/issues/10227#issuecomment-307642940 is basically the same as @Bisaloo's https://github.com/EFForg/https-everywhere/issues/10227#issuecomment-307646858.

Besides, how about merging utils/duplicate-whitelist.txt and utils/downgrade-whitelist.txt into the new _whitelist_ we proposed, such that we keep only 1 whitelist?

cschanaj on 12 Jun 2017

I'm okay with the order hash,auto-pass rules test,auto-pass fetch test,filename, given @Bisaloo 's point about hashes being the same length.

jeremyn on 12 Jun 2017

👍1

This is slightly tangential but the end goal for ruleset-coverage-whitelist.txt is to be removed, isn't it? As far as I understand, it is only here for legacy purposes and I see no actual case where we would need to use it.

Bisaloo on 16 Jun 2017

I assumed that this issue implied a re-run of rules and fetch test on all entries in the ruleset-coverage-whitelist.txt, this should help to remove false positive mentioned in https://github.com/EFForg/https-everywhere/issues/10227#issuecomment-307553666. I also expected that a new ruleset-coverage-whitelist-cleanup.sh (maybe another script ruleset-coverage-whitelist-deep-cleanup.sh) to perform the tests too.

@Hainish can you clarify?

cschanaj on 16 Jun 2017

Thanks for all your input. I agree, consolidating the whitelists seems like a good idea, and I'm in favor of hash,auto-pass rules test,auto-pass fetch test,filename with explicit 0s and 1s for values. This will also involve consolidating and changing the cleanup scripts, as well.

@Bisaloo in the old schema, we wished to remove ruleset-coverage-whitelist.txt altogether, since those files contained in the list were bypassing validation just by the merit of their legacy status. The new schema is much the same, except for the fact that combining the whitelists means that we'd have to get rid of multiple categories of bypasses completely in order to achieve this goal. In an ideal world, curl would behave almost exactly like the browser for the purposes of our testing. In reality, it doesn't, so it may be impractical to completely remove the whitelist in the future. We may be able to remove a few columns from the csv, though.

Hainish on 17 Jun 2017

👍1

CAcert.xml probably needs to be whitelisted as well.

Bisaloo on 24 Jun 2017

This is a problematic ruleset example where 404 error is expected but means it cannot pass the testing if there is no whitelist: #10708 (Red_Hat.xml)

ghost on 26 Jun 2017

@Bisaloo also each platform="cacert" ruleset.

ghost on 26 Jun 2017

All rulesets such as https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/CAcert.xml that are signed by the CACert pseudo-CA should be disabled with default_off='cacert'. This will cause it to pass the fetch test. The problem isn't that it's self-signed, the problem is that it is not a valid CA in the browsers by default.

(.venv)user@mutt:~$ echo | openssl s_client -connect cacert.org:443 | openssl x509 -noout -text | head
depth=2 O = Root CA, OU = http://www.cacert.org, CN = CA Cert Signing Authority, emailAddress = [email protected]
verify error:num=19:self signed certificate in certificate chain
verify return:0
DONE
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 167975 (0x29027)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: O=CAcert Inc., OU=http://www.CAcert.org, CN=CAcert Class 3 Root
        Validity
            Not Before: Apr 21 10:44:32 2016 GMT
            Not After : Apr 21 10:44:32 2018 GMT
        Subject: C=AU, ST=NSW, L=Sydney, O=CAcert Inc., CN=www.cacert.org
(.venv)user@mutt:~$

Hainish on 26 Jun 2017

I changed the reason of CACert ruleset being disabled from "failed ruleset test" to "cacert".

ghost on 26 Jun 2017

@Hainish there is 37 ruleset with platform="cacert", are we going to do the same thing as in https://github.com/EFForg/https-everywhere/pull/10733#issuecomment-311223390 with all of them?

Now related: #9906

cschanaj on 27 Jun 2017

That's what I was thinking. Then again it may be more useful just to change the fetch tests to auto-pass platform='cacert' rules, or (better yet) for these rules to include the cacert root, which you can provide to the underlying pycurl library.

On June 26, 2017 6:15:10 PM PDT, Chan Chak Shing notifications@github.com wrote:

@Hainish there is 37 ruleset with platform="cacert", are we going to
do the same thing as in
https://github.com/EFForg/https-everywhere/pull/10733#issuecomment-311223390
with all of them?

Now related: #9906

--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/EFForg/https-everywhere/issues/10227#issuecomment-311225069

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Hainish on 27 Jun 2017

I prefer to modify the fetch test since I want to avoid adding new default_off attribute. I will look into the code later. Thanks!

cschanaj on 27 Jun 2017

Bisaloo on 27 Jun 2017

Fixed in #10900

Hainish on 7 Jul 2017

11527

cschanaj on 30 Jul 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Travis is not running

cschanaj · 3Comments

Travis CI changes

J0WI · 4Comments

Certain domain (pitt.edu) breaks when HTTPS everywhere chrome extension is active

Cull-Methi · 4Comments

HTTPS no longer works on j.mp - disable the ruleset

anonsubmitter · 5Comments

how diverse is https-everywhere

00h-i-r-a00 · 4Comments