Ublock: [Performance] tokenize URLs to numeric values instead of substring values

Created on 19 May 2017 · 1Comment · Source: gorhill/uBlock

Describe the issue

An idea I wanted to explore since a while, and given the ongoing refactoring ni teh static filtering engine, it was a good time to put it to test.

When a URL needs to be tested by the static filtering engine, this URL needs to be decomposed into "tokens". These tokens allow to efficiently lookup _potentially_ matching filters which need to be further tested for a real match.

For example, the following URL:

https://a1.nyt.com/assets/homepage/20170518-152243/js/foundation/views/ad-view-manager.js

When uBO receives the above URL from the webRequest API, the static filtering engine must decompose the URL into tokens for efficient subsequent filter matching:

https
a1
nyt
com
assets
homepage
20170518
152243
js
foundation
views
ad
view
manager
js

For each of these tokens, the static filtering engine will look up a filter (or filter bucket when more than one filter were stored at a given token "address").

For instance, an actual EasyList filter:

/ad-view-

When uBO parsed that filter, it extracted a token from it -- ad in the current case -- and stored the filter at the ad token address.

So when the static filtering engine look up if there is a filter at the ad slot, if find that filter, and then proceed to see if its a match.

One can actually see these token "addresses" by entering µBlock.staticNetFilteringEngine.categories.get('0') in uBO's dev console in Chromium, and expand the result:

As you can see above, the filters are stored in a Map() object, and keyed on string values -- the tokens.

Since a while now, I have floated the idea that there was no gain tokenizing beyond the first few characters, as you can see above, after a few characters, the tokens become distinct enough that there is no gain to be had by having longer tokens.

Another problem with tokens-as-string is that due to how modern javascript engines work, these tokens are more than likely slices of a larger string in memory. Each of these slices requires a memory structure to describe them, and this is even more of a concern when tokenizing on the fly every single URL which reaches the static filtering engine (all URLs reaches this point unless one uses dynamic filtering in some default-deny mode).

This is the tokenizing code executed for every URL to be evaluated in the current stable version: https://github.com/gorhill/uBlock/blob/a4e20ae3ad869a18f66eeb940273a625d342578e/src/js/utils.js#L74. As can be seen, a regular expression is involved to cut out the tokens from the URL.

The idea of limiting the number of characters in a token opens the possibility of integer-based tokens, rather than string-based. If we accept that a maximum of 8 characters are enough token values, then it becomes possible to create integer-based tokens, use these integer as key in our Map(), which means memory saving, and speed since no regex is needed to extract tokens.

This benchmark supports that hashing the tokens into integer does improve performance: https://gorhill.github.io/obj-vs-set-vs-map/tokenize-to-str-vs-to-int.html.

Source