Https-everywhere: [Discussion] Trivial domains list/radix trie

Created on 19 Sep 2017 · 11Comments · Source: EFForg/https-everywhere

ping @Hainish @cowlicks @RReverser @cschanaj @J0WI @gloomy-ghost.

Trivial rulesets will no longer be stored as XML files, rather they will be stored as a domain list.

A ruleset is considered to be trivial if all is true:

Ruleset doesn't have platform or default_off attributes.
Ruleset doesn't have any exclusions.
Ruleset has only one rule, which redirects all HTTP requests to HTTPS.

A trivial ruleset's XML file is deleted, and all its targets are added to the domain list.

The domain list is stored as a CSV file in the repository, empty strings and strings starting with # are ignored. The domain is followed by a comma and true or false depending on whether the cookies must be secured.

The domain list starts with a line host,secureCookies.

The domain list is packaged in form of a radix trie. Radix trie format is:

{
  "com": {
    example: {
      "^": true
      "www": {
        "^": true
      }
    }
  }
}

^ signifies termination of a branch and is true if cookies need to be secured and false otherwise.

While a JSON radix trie is larger than a plain domain list, it compresses to a smaller size than the plain domain list.

Source

ghost

Most helpful comment

I'm not arguing that we should keep trivial XML rulesets forever. What I am saying is that I see possible confusion and edge cases if we have two different ways of adding rules. It's worth creating the documentation first to make sure the end result is going to be easier for users than what we have now.

As just one example, imagine a domain with nine subdomains that are simple and one subdomain that is not. Some people are going to put all ten into an XML file. Other people are going put the nine easy ones into the CSV file (or whatever) and the tenth into an XML file. Yet other people are going to just add the nine easy ones and ignore the tenth. There may be comments in the XML file that describe problems in the CSV file. People may create an XML file that is entirely a big comment for stuff in the CSV file. There needs to be rules for all of this, ideally enforced in the code.

Understand that gray area in the documentation is very tedious for a maintainer. It means everyone comes up with their own way of wanting things done and then we have endless arguments. Having a process that is unambiguous and well documented is preferable to a process that seems more elegant but requires constant oversight.

jeremyn on 22 Sep 2017

👍2

All 11 comments

Currently about 6129 rulesets are trivial. Deleting them could save 3.58 MB. The resulting gzipped trie is 252 KB. A radix trie is also much more efficient to search.

ghost on 19 Sep 2017

I've been playing with combining trivial rulesets too but it breaks "disable particular rulesets from extension menu" functionality which I imagine is quite important for some users or when rewrites are simply incorrect and you need a quick fix.

RReverser on 20 Sep 2017

@RReverser A better way to do that is to allow the user to whitelist any URL from the rewriting (but probably not from HTTP blocking, we need a separate whitelist for blocking HTTP).

ghost on 20 Sep 2017

Could you not generate a ruleset name and checkbox based on starting at the matching host, and searching up the tree until you hit a public suffix, i.e. if you're at some.host.example.com, the extension popup would display a checkbox for example.com?

bardiharborow on 20 Sep 2017

@bardiharborow ~~These "pseudo-rulesets" would negate all memory savings.~~

ghost on 20 Sep 2017

@koops76 not if they are only generated when a currently open tab is using the affected targets (or maybe even only when the popup is opened), and garbage collected after. If the user disables any rulesets, then that is stored as a delta from the primary radix trie.

bardiharborow on 20 Sep 2017

There are two issues being discussed here, right? One is how trivial rulesets are represented in the code that's actually run by the browser (trivial-domains-trie.json), and the other is how trivial rulesets should be represented in the source code and manipulated by contributors (trivial-domains.csv).

I have no concerns at all with the first item, simplifying the internal representation of the rules in the running code. If moving simple rules into something that looks like Chrome's HSTS preload transport_security_state_static.json improves performance and reduces the file size of the distributed extension, awesome.

How the rulesets are represented in the source code is another matter. Efficiency there is irrelevant since they can be manipulated into any other format at build time. Instead we need to think of the XML files as the UI for working with HTTPS Everywhere's rulesets. So, the only relevant question is whether using trivial-domains.csv for some domains is an improved UI for contributors.

After working with a lot of these ruleset files, I think the existing XML file approach is a terrible UI. However, it's the UI that's been used for years and there is a lot of documentation and knowledge and norms that have been established around it. trivial-domains.csv essentially adds a second way of doing things while keeping the first. New contributors will have to figure out which method to use, and maintainers will have to spend time educating contributors on both systems.

I think the way to go here is to separate work on the two issues of internal representation/trivial-domains-trie.json vs contributor UI/trivial-domains.csv. They don't need to be tied together. The UI portion should also be user-experience driven, at the very least user documentation should be created and discussed before any technical work is done. I don't mean anything fancy here, like, just a PR to CONTRIBUTING.md describing the new system would be enough to get started.

jeremyn on 22 Sep 2017

👍2

@jeremyn Ours is even more efficient, a radix trie instead of a simple list.

ghost on 22 Sep 2017

I disagree that we should keep XML for trivial rulesets.

ghost on 22 Sep 2017

We can leave XML for current rulesets though.

ghost on 22 Sep 2017

jeremyn on 22 Sep 2017

👍2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Question: addon release tag

margre8 · 3Comments

Update Firefox logo on https://www.eff.org/https-everywhere

J0WI · 3Comments

EASE mode does not work on Firefox (Fenix)

cschanaj · 4Comments

Travis is not running

cschanaj · 3Comments

Spam notifications started (for domain Mozgvya.com), after a recent ruleset was added for it

Lissy93 · 4Comments