ping @Hainish @cowlicks @RReverser @cschanaj @J0WI @gloomy-ghost.
Trivial rulesets will no longer be stored as XML files, rather they will be stored as a domain list.
A ruleset is considered to be trivial if all is true:
platform or default_off attributes.A trivial ruleset's XML file is deleted, and all its targets are added to the domain list.
The domain list is stored as a CSV file in the repository, empty strings and strings starting with # are ignored. The domain is followed by a comma and true or false depending on whether the cookies must be secured.
The domain list starts with a line host,secureCookies.
The domain list is packaged in form of a radix trie. Radix trie format is:
{
"com": {
example: {
"^": true
"www": {
"^": true
}
}
}
}
^ signifies termination of a branch and is true if cookies need to be secured and false otherwise.
While a JSON radix trie is larger than a plain domain list, it compresses to a smaller size than the plain domain list.
Currently about 6129 rulesets are trivial. Deleting them could save 3.58 MB. The resulting gzipped trie is 252 KB. A radix trie is also much more efficient to search.
I've been playing with combining trivial rulesets too but it breaks "disable particular rulesets from extension menu" functionality which I imagine is quite important for some users or when rewrites are simply incorrect and you need a quick fix.
@RReverser A better way to do that is to allow the user to whitelist any URL from the rewriting (but probably not from HTTP blocking, we need a separate whitelist for blocking HTTP).
Could you not generate a ruleset name and checkbox based on starting at the matching host, and searching up the tree until you hit a public suffix, i.e. if you're at some.host.example.com, the extension popup would display a checkbox for example.com?
@bardiharborow These "pseudo-rulesets" would negate all memory savings.
@koops76 not if they are only generated when a currently open tab is using the affected targets (or maybe even only when the popup is opened), and garbage collected after. If the user disables any rulesets, then that is stored as a delta from the primary radix trie.
There are two issues being discussed here, right? One is how trivial rulesets are represented in the code that's actually run by the browser (trivial-domains-trie.json), and the other is how trivial rulesets should be represented in the source code and manipulated by contributors (trivial-domains.csv).
I have no concerns at all with the first item, simplifying the internal representation of the rules in the running code. If moving simple rules into something that looks like Chrome's HSTS preload transport_security_state_static.json improves performance and reduces the file size of the distributed extension, awesome.
How the rulesets are represented in the source code is another matter. Efficiency there is irrelevant since they can be manipulated into any other format at build time. Instead we need to think of the XML files as the UI for working with HTTPS Everywhere's rulesets. So, the only relevant question is whether using trivial-domains.csv for some domains is an improved UI for contributors.
After working with a lot of these ruleset files, I think the existing XML file approach is a terrible UI. However, it's the UI that's been used for years and there is a lot of documentation and knowledge and norms that have been established around it. trivial-domains.csv essentially adds a second way of doing things while keeping the first. New contributors will have to figure out which method to use, and maintainers will have to spend time educating contributors on both systems.
I think the way to go here is to separate work on the two issues of internal representation/trivial-domains-trie.json vs contributor UI/trivial-domains.csv. They don't need to be tied together. The UI portion should also be user-experience driven, at the very least user documentation should be created and discussed before any technical work is done. I don't mean anything fancy here, like, just a PR to CONTRIBUTING.md describing the new system would be enough to get started.
@jeremyn Ours is even more efficient, a radix trie instead of a simple list.
I disagree that we should keep XML for trivial rulesets.
We can leave XML for current rulesets though.
I'm not arguing that we should keep trivial XML rulesets forever. What I am saying is that I see possible confusion and edge cases if we have two different ways of adding rules. It's worth creating the documentation first to make sure the end result is going to be easier for users than what we have now.
As just one example, imagine a domain with nine subdomains that are simple and one subdomain that is not. Some people are going to put all ten into an XML file. Other people are going put the nine easy ones into the CSV file (or whatever) and the tenth into an XML file. Yet other people are going to just add the nine easy ones and ignore the tenth. There may be comments in the XML file that describe problems in the CSV file. People may create an XML file that is entirely a big comment for stuff in the CSV file. There needs to be rules for all of this, ideally enforced in the code.
Understand that gray area in the documentation is very tedious for a maintainer. It means everyone comes up with their own way of wanting things done and then we have endless arguments. Having a process that is unambiguous and well documented is preferable to a process that seems more elegant but requires constant oversight.
Most helpful comment
I'm not arguing that we should keep trivial XML rulesets forever. What I am saying is that I see possible confusion and edge cases if we have two different ways of adding rules. It's worth creating the documentation first to make sure the end result is going to be easier for users than what we have now.
As just one example, imagine a domain with nine subdomains that are simple and one subdomain that is not. Some people are going to put all ten into an XML file. Other people are going put the nine easy ones into the CSV file (or whatever) and the tenth into an XML file. Yet other people are going to just add the nine easy ones and ignore the tenth. There may be comments in the XML file that describe problems in the CSV file. People may create an XML file that is entirely a big comment for stuff in the CSV file. There needs to be rules for all of this, ideally enforced in the code.
Understand that gray area in the documentation is very tedious for a maintainer. It means everyone comes up with their own way of wanting things done and then we have endless arguments. Having a process that is unambiguous and well documented is preferable to a process that seems more elegant but requires constant oversight.