Type: code issue
I personally use xz -9e or 7z with ppmd algo depending on what is better.
Good luck implementing decompression in JS.
There is also brotli: https://github.com/foliojs/brotli.js
But I'm not sure if it's worth to implement a whole library to get smaller files.
LZMA2 is very memory- and CPU-hungry. Not an option for battery-powered resource-constrained devices.
Currently, the compression ratio is around 19.4%. I doubt if we can get much better by simply switching the algorithm.
$ cat .git/HEAD
e14027653a97a9eb683e35b674dfc724a9366c47
$ python utils/merge-rulesets.py
$ du -sh rules/default.rulesets
6.7M rules/default.rulesets
$ wget https://www.https-rulesets.org/v1/default.rulesets.1537387579.gz
$ du -sh default.rulesets.1537387579.gz
1.3M default.rulesets.1537387579.gz
@cschanaj xz -9e should give 1.1 MiB.
Yes. But this is a breaking change in the API on https://www.https-rulesets.org/v1/ which might affect the downstream project. It also affect how third-parties update channels should provide their files if we modify the way decompression works within the extension. I am not really sure if this is totally worthy.
brotli seems to perform a little bit better than xz. Maybe we could use Content-Encoding? https://caniuse.com/#search=brotli
Maybe we could use Content-Encoding?
This issue was about current storage format. I mean that currently it stores base64-encoded gzip in json, b64 is inefficient, using json also flash writes. So the fast way to reduce amount of data written into flash before #16551 is implemented is to just change compression algo.
What you propose is to enable live decompression, this would require recompression to write them into flash (or we can not to write them in flash at all and keep this in memory, and I guess it would be optimal since to access the sites from the list a user may need Internet connection, the list is updated regularily, so we can just download it (or a diff to a stable version bundled with the addon) on browser start and keep in memory).
Given the clarification in https://github.com/EFForg/https-everywhere/issues/16552#issuecomment-423810925, I agree that the rulesets storage can be optimized. Moreover, this can be done without affecting the APIs. Maybe this issue can be merged with #16551?
Just 50 cents about compression algos:
$ ls -go --time-style=+"" -Sr
total 23720
-rw-r--r-- 1 1046871 default.rulesets.7zip.ppmd.zip
-rw-r--r-- 1 1059897 default.rulesets.7zip.PPMD.7z
-rw-r--r-- 1 1059897 default.rulesets.1537387579.7zip.LZMA.7z
-rw-r--r-- 1 1098768 default.rulesets.1537387579.xz
-rw-r--r-- 1 1099516 default.rulesets.1537387579.7zip.xz
-rw-r--r-- 1 1099622 default.rulesets.1537387579.7zip.lzma2.7z
-rw-r--r-- 1 1099697 default.rulesets.1537387579.br
-rw-r--r-- 1 1099777 default.rulesets.1537387579.7zip.lzma.zip
-rw-r--r-- 1 1199801 default.rulesets.1537387579.7zip.bz2
-rw-r--r-- 1 1199963 default.rulesets.1537387579.7zip.bzip2.7z
-rw-r--r-- 1 1199989 default.rulesets.1537387579.7zip.bzip2.zip
-rw-r--r-- 1 1223669 default.rulesets.1537387579.7zip.deflate64.zip
-rw-r--r-- 1 1247129 default.rulesets.1537387579.7zip.gz
-rw-r--r-- 1 1247271 default.rulesets.1537387579.7zip.deflate.zip
-rw-r--r-- 1 1316565 default.rulesets.1537387579.gz
-rw-r--r-- 1 6963578 default.rulesets.1537387579
All compressed with the maximum compression available in tools. For 7zip GUI was used to set up the commpression params. The sizes are the sizes of files with headers which can cause differences. Some compressiion libs allow to eliminate some headers, keeping this info in other place.
ppmd is expectedly the best since it is optimized for compressing text.
Most helpful comment
LZMA2 is very memory- and CPU-hungry. Not an option for battery-powered resource-constrained devices.