Ublock: [Performance] Mind the potential negative consequences of String.slice()

Created on 9 May 2017  路  4Comments  路  Source: gorhill/uBlock

Consider those two strings:

  1. quebec
  2. qu茅bec

Those two strings have the same number of characters, however the first form will occupy half the memory of the second form, because quebec is only made of ASCII characters, while qu茅bec has at least one Unicode character.

This is currently how javascript in modern browsers internalize strings: if a string has at least one character outside the ASCII realm, the javascript engine will use a 2-byte-per-character string, otherwise it will use a 1-byte-per-character string. (See "Slimmer and faster JavaScript strings in Firefox").

When uBO compiles a filter list, all the hostnames are converted to punycode, so this eliminates a lot of Unicode characters from the resulting compiled list.

However, there may still be Unicode characters in other parts of some filters which can't be easily normalized to ASCII. For example, found in EasyList at time of writing:

  • flashgot.net###head a[target="_bl邪nk"]
  • flashgot.net##.content a[rel="nofollow"][target="_bl邪nk"]
  • noscript.net##a[target="_bl邪nk"][href$="?MT"]

These are not normalized to ASCII by uBO because the Unicode characters are found in the CSS selector part of the filters (the in _blank is actually the Cyrillic _A_ character). CSS selectors _could_ be normalized to ASCII by uBO, however this would not work for procedural cosmetic filters.

Now because of these mere three instances of Unicode characters in the whole resulting compiled EasyList file, the memory footprint required to hold the string instance[1] in memory is 6,109,248 bytes (as reported by Chromium):

a

With a quick (short-term) code change to ensure no Unicode character in the output compiled list, the memory required to hold the string instance is now halved at 3,017,232 bytes:

a

The gain is of course better for larger compiled filter list, such as Fanboy Ulitmate.

Since EasyList is selected by default, ensuring that no Unicode character end up in the compiled form of EasyList would allow to easily lower further uBO's memory footprint.

[1] Javascript engine will hold that one single string instance in memory even when it's not used directly, because all the filters will hold references to substrings in that one string instance.

Most helpful comment

I am reopening this to investigate a possibly higher-level solution.

Taking a larger view of the issue, the problem is that the large parent string (loaded in memory as a result of loading the compiled filter lists) stays in memory even when no longer explicitly referenced anywhere, but as a result of all the child substrings internally referencing it.

Surely ensuring that this huge parent string is made only of ASCII characters helps to halve the memory needed to hold that large string (assuming it had Unicode characters), but what if that large parent string could be completely flushed out of memory by forcing all child strings to no longer internally reference the parent string?

Benefits:

  • No longer need to worry about whether a compiled filter list holds only ASCII characters.
  • Potentially better memory saving than the previous solution -- especially in the case where a lot of duplicates are detected by uBO.

Challenge: how can uBO prevent the javascript engine from creating substrings with internal reference to the large parent compiled filter list string? (rhetorical, I experimented and found a way).

Background information which has been enlightening:

All 4 comments

Those cases aren't even proper uses of the _blank keyword in the HTML target attribute (the keyword is all-ASCII), so I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

I have reported it: https://forums.lanik.us/viewtopic.php?f=62&t=36726

The problem is not EasyList, that's how the elements to hide are crafted on flashgot.net and noscript.net sites.

Fixed with 02323826950c7d99b65d93f9edb8fa018159973c.

I am reopening this to investigate a possibly higher-level solution.

Taking a larger view of the issue, the problem is that the large parent string (loaded in memory as a result of loading the compiled filter lists) stays in memory even when no longer explicitly referenced anywhere, but as a result of all the child substrings internally referencing it.

Surely ensuring that this huge parent string is made only of ASCII characters helps to halve the memory needed to hold that large string (assuming it had Unicode characters), but what if that large parent string could be completely flushed out of memory by forcing all child strings to no longer internally reference the parent string?

Benefits:

  • No longer need to worry about whether a compiled filter list holds only ASCII characters.
  • Potentially better memory saving than the previous solution -- especially in the case where a lot of duplicates are detected by uBO.

Challenge: how can uBO prevent the javascript engine from creating substrings with internal reference to the large parent compiled filter list string? (rhetorical, I experimented and found a way).

Background information which has been enlightening:

Was this page helpful?
0 / 5 - 0 ratings