Rust: Tarballs we deliver should be compressed better

Created on 28 Jan 2015 · 40Comments · Source: rust-lang/rust

Today’s rust-nightly-x86_64-unknown-linux-gnu.tar.gz is 125MiB in size. I did a make dist-tar-bins which output ~~the same~~ tarball, but only 88MiB in size. ~~This is 70% of whatever we publish to s3.~~

I took liberty to also test:

~~xz (the default level, -6) → 69MiB (55% original);~~
~~xz -9 → 59MiB (47% original, but has high memory requirements to decompress)~~
~~bz2 → 82MiB (65% original);~~
~~lzma → 69MiB, but took longer than xz.~~

I strongly propose to either migrate to a more modern compression algorithm (xz) ~~or at least investigating why gzip does such a bad job on the build bots~~.

cc @brson

P-low

Source

nagisa

👍3

Most helpful comment

I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
The results and the Makefile I used for my experiments are available here.

cat $tarball | gunzip | xz > $tarball.xz reduces the latest tarball from 135 MB to 95 / 90 MB (without / with the -9 flag).
Changing the order in which tar stores the files makes it possible to compress it down to 79 / 58 MB (without / with the -9 flag).

I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :)

ranma42 on 24 Mar 2017

👍4

All 40 comments

Apparently our s3 distribution includes docs, which my make dist-tar-bins didn’t include for some reason. That’s the cause of discrepancy between gzip sizes on my system and what’s shipped through s3.

I downloaded the tarball we ship and recompressed it for a fairer comparison:

xz → 75MiB;
xz -9 → 73MiB (honestly, surprised about how small the improvement is over xz -6);
bz2 → 95MiB;

nagisa on 28 Jan 2015

The tarball (2015-03-03) has since increased to 138MiB in size. Recompressing that to xz results in 82MiB .tar.xz, for a 40% gain.

nagisa on 4 Mar 2015

Do we know of statistics of how many systems can decompress a xz archive? I'm mostly just curious as I suspect that we would continue to produce both and then systems like multirust could figure out which to download based on the host system.

alexcrichton on 4 Mar 2015

As I and others discussed on IRC:

Macs don’t appear to have this by default, unsure, recipe on Homebrew;
Windows have neither gzip, nor xz, nor tar by default, so it doesn’t matter either way;

On linux systems some very common packages (with defaultish configure options) depend on liblzma5/xz (gdb and systemd come to mind as examples) so it is very likely it will be available on a standard linux system.

nagisa on 4 Mar 2015

OS X does support xz by default in its 'tar' command (which is bsdtar - not sure exactly when support was introduced, but I think it was in 10.9) and Archive Utility (apparently newly in 10.10). This works via a library rather than the xz command line utility, which is not provided.

comex on 6 Mar 2015

Are you sure it is not just shelling out to the xz command? Does 'tar xJf
file' work without xz executable in the path?

EDIT: never mind, apparently bstdar just links and uses liblzma5 directly. This is very nice, I should consider dropping gnu tar and use bsdtar myself.

Either way, this means that using xz will benefit most of the users of both linux and os x.

2015/03/06 6:52 "comex" [email protected]:

OS X does support xz by default in its 'tar' command (which is bsdtar -
not sure exactly when support was introduced, but I think it was in 10.9)
and Archive Utility (apparently newly in 10.10).

—
Reply to this email directly or view it on GitHub
https://github.com/rust-lang/rust/issues/21724#issuecomment-77506380.

nagisa on 6 Mar 2015

The 7-Zip Utility for Windows can decompress xz archives, according to Wikipedia. That's the only archiver I use on Windows. It's free and open source, but it is third-party.

However, isn't the installer the preferred way of getting Rust on Windows? That's what I use. I don't know how the installer does decompression but xz support will probably have to be implemented for it.

abonander on 4 Apr 2015

In addition to 7-Zip (which invented the underlying LZMA2 compression format) XZ is also supported by all the other major tools not already mentioned:

WinRAR (If you're running Windows, technical enough to be interested in Rust, and not using 7-zip, it's probably because you prefer WinRAR)
The Unarchiver (Serves a similar role to 7-zip as the "extract anything" utility of choice for OSX power users)
PeaZip (As far as I can tell, what people generally fall back to when they want an open-source, GUI-based "do 99% of what I need" archive tool for Windows but don't like 7-Zip's UI. Of course, given that most people who don't like 7-zip just put up with WinRAR's shareware nags, it's not got much market share even as the next in line.)

Edit: I need to stop forgetting to double-check my memory of what download pages are offering before posting. I've trimmed out some irrelevant bits.

ssokolow on 4 Apr 2015

Strong +1 to @alexcrichton 's suggestion that we simply provide both. It costs us relatively nothing to construct both artifacts; is there any serious cost to provide them both (e.g. are we worried about connection charges or storage space on our servers?)

pnkfelix on 4 Apr 2015

An update since the metadata reform. Now gzipped tarball is 107MB. xzipped is 78MB. Still an easy 30MB win.

EDIT: docless xz: 75MB, docless gzip: 100MB.

nagisa on 9 Apr 2015

Is this still an issue today?

steveklabnik on 4 Nov 2015

Recompressing gzip from https://static.rust-lang.org/dist/rust-1.4.0-x86_64-unknown-linux-gnu.tar.gz to xz still goes from 97MB to 75MB, a win of 22MB.

nagisa on 4 Nov 2015

http://static.rust-lang.org/dist/rust-nightly-x86_64-unknown-linux-gnu.tar.gz from ~100M to ~75M as well;
http://static.rust-lang.org/dist/rust-std-nightly-x86_64-unknown-linux-gnu.tar.gz from 51M to ~40M

nagisa on 4 Nov 2015

FYI, nowadays even Busybox supports xz

nodakai on 28 Feb 2016

We now have switched to the full Rust solution for distribution, so we can easily switch from tar.gz to tar.xz in that case.

lifthrasiir on 16 Jun 2016

I recently had a chance to live on data-capped tether for 2 weeks and it hurt me very hard when the new stage0 compiler got in. It took me a considerable amount of time to download the new compiler and put a noticeable dent into my data allowance. Both of those would have been much more bearable with xz.

nagisa on 31 Aug 2016

👍2

Please provide .xz downloads for the source tarballs. I am a packager for Mageia Linux and downloading the tar.gz and then uploading it our tarballs server over my slow ADSL upstream is time-consuming. I tried to compress the tar.gz tarball using xz -9 --extreme and the savings are significant:

shlomif[rpms]:$mageia/rust/SOURCES$ ls -l rustc-1.11.0-src.tar.gz ~/rustc-1.11.0-src.tar.xz 
-rw-r--r-- 1 shlomif shlomif 17108400 Sep  8 20:56 /home/shlomif/rustc-1.11.0-src.tar.xz
-rw-r--r-- 1 shlomif shlomif 26126471 Aug 16 13:39 rustc-1.11.0-src.tar.gz

That's a 34% saving.

shlomif on 8 Sep 2016

I'd like to make this happen but it's quite complex to do. I think the basic way to do it is to recompress all the tarballs in one batch job at the same time during final manifest generation. It would be great to do it in a way that isn't conflated with other parts of the build infrastructure, so that it can be developed and tested independently of buildbot. Unfortunately the way the entire set of artifacts is put together is quite complex. I tried to write up a design that somebody else could implement but got pretty discouraged.

But some requirements I think

It should be one big batch job that recompresses all the tarballs
The names of the xz files and their hashes need to end up in the manifest files so they can be validated by rustup. This requirement makes things particularly hard since the format has to be expanded backwards compatibly, and the recompression must be done before creation of the manifest.

I do want to redesign the entire release build process, and it might be easier to make this happen as part of a redesign.

brson on 8 Sep 2016

Compressing just the source tarball can probably be done relatively easily by modifying the build system with a --xz-source-tarball flag which we could enable on the linux bots.

brson on 8 Sep 2016

If we're counting calories, stripping .rustc from rustc/lib/*.so (but not rustlib/!) saves about 9MB from the current unpacked nightly dist, and that savings even translates directly to compressed forms since .rustc was already compressed.

There are also a few spots of debuginfo, but that saves less.

cuviper on 13 Oct 2016

👍1

Ok, I think nowadays we're quite ready to be poised to do this! Specifically I believe the steps would look like:

First, familiarize yourself with dist.rs where all distribution related code lives.
Update Travis/AppVeyor to have the required software to perform xz compression (or whatever we choose)
For all tarballs we created, create xz versions as well. I think this'd basically look like:
- after the tarball is created
- execute cat $tarball | gunzip | xz > $tarball.xz
Update Travis/AppVeyor to upload xz tarballs (this may not actually require any changes)
Update the manifest generator to list xz in the manifest. The precise format is somewhat undecided but we can likely add sibling xz-url = '...' keys to the existing url keys (along with a hash). @brson or I should be contacted about this.
Update rustup.rs to parse the new xz keys in the manifest
Add xz decompression support to rustup (via a library). This is one example library, there may be more
Change rustup to prefer xz by default (if it proves itself)
Rejoice!

alexcrichton on 23 Mar 2017

I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :)

ranma42 on 24 Mar 2017

👍4

@ranma42 holy cow I had no idea we could get such a drastic improvement by reordering files, that's awesome!

FWIW the tarball creation itself is likely buried in rust-installer which may be difficult to modify but not impossible! Eventually I'd love to completely rewrite the rust-installer repo in Rust itself (e.g. src/tools/rust-installer) but that doesn't have to necessarily happen before this.

alexcrichton on 24 Mar 2017

I've tried to get better results than @ranma42 with both brotli and zstd at maximum compression settings (22 for zstd, 11 for brotli) but they were far behind xz compression at default setting (for rust-nightly-x86_64-unknown-linux-gnu.tar.gz):

460M    files.gnu.tar
125M    files.gnu.tar.bz2
96M     files.gnu.tar.xz
92M     files.gnu.tar.xz9
103M    files.gnu.tar.zst
460M    rev-sorted-files.gnu.tar
88M     rev-sorted-files.gnu.tar.bro
123M    rev-sorted-files.gnu.tar.bz2
79M     rev-sorted-files.gnu.tar.xz
58M     rev-sorted-files.gnu.tar.xz9
85M     rev-sorted-files.gnu.tar.zst

For the rust-src-nightly.tar.gz they were behind as well, just not as far:

181M    files.gnu.tar
25M     files.gnu.tar.bz2
31M     files.gnu.tar.gz
22M     files.gnu.tar.xz
21M     files.gnu.tar.xz9
23M     rev-sorted-files.gnu.tar.bro
32M     rev-sorted-files.gnu.tar.gz
22M     rev-sorted-files.gnu.tar.xz
22M     rev-sorted-files.gnu.tar.xz9
23M     rev-sorted-files.gnu.tar.zst

Also note that brotli takes far longer to compress than any other algo. In the second diagram, you can see that reverse sorting in fact has a tiny negative inpact for source code. I too would suggest to go with xz at level 9 with reverse ordering, as its a) far more widespread than zstd/brotli and b) possible better decompression speed for zstd/brotli is an unimportant advantage.

est31 on 25 Mar 2017

I wonder if we can improve the sorting by either using a similarity hash and order by hash value, or even use a distance metric and Floyd-Warshall to find out the cheapest path through all files.

Then again that's probably overdoing it.

llogiq on 25 Mar 2017

😄1 👍1

@llogiq the "reverse name sorting" trick is a cheap approximation of that, because it clusters files with the same extension. In the case of rust object files, it is effectively also sorting them by their hash, ensuring that identical libraries are adjacent in the list.

If we want to squeeze the tarball further, I would suggest investigating the biggest files in the release:

librustc_llvm.so is 62 MB, but I do not expect it to squeeze easily
the cargo binary is 37 MB and it looks like it statically links all of its dependencies; maybe it would be possible to dynamically link some of them to reduce its size? (strip can squeeze it down to 9 MB, but this would make the debugging experience worse)

ranma42 on 25 Mar 2017

Perhaps we should setup stripped binaries after all – as the savings are substantial. It may allow some people to use Rust who currently cannot afford it.

llogiq on 25 Mar 2017

The difference between fully stripped and not stripped when decompressed is 120MB. Difference when compressed (for sorted files) is 8MB.

nagisa on 25 Mar 2017

Being bold, we could also think of every single function as one "file", reorder those using similarity hashes (or floyd-warshall, although I guess the number would be too high for pure floyd-warshall), and provide a self extracting archive or something. That would solve the "cargo links everything statically" problem.

est31 on 25 Mar 2017

Just in case I have tested other options with xz in search of more size reduction and less memory in decompression [1]: (The archive used is 2017-03-15 nightly, which should be same to @ranma42's)

-rw-rw-r--  1 lifthrasiir  81628064 Mar 26 22:41 rev-sorted-files.bsd.tar.xz
-rw-rw-r--  1 lifthrasiir  81080832 Mar 26 22:47 rev-sorted-files.bsd.tar.xz6e
-rw-rw-r--  1 lifthrasiir  74531756 Mar 26 22:50 rev-sorted-files.bsd.tar.xz7
-rw-rw-r--  1 lifthrasiir  73955700 Mar 26 22:56 rev-sorted-files.bsd.tar.xz7e
-rw-rw-r--  1 lifthrasiir  74053348 Mar 26 23:00 rev-sorted-files.bsd.tar.xz8
-rw-rw-r--  1 lifthrasiir  73494812 Mar 26 23:06 rev-sorted-files.bsd.tar.xz8e
-rw-rw-r--  1 lifthrasiir  60213532 Mar 26 23:10 rev-sorted-files.bsd.tar.xz9
-rw-rw-r--  1 lifthrasiir  59672056 Mar 26 23:17 rev-sorted-files.bsd.tar.xz9e

*.xz file corresponds to the default option (-6). I've also tested -6e thorugh -9e which tries to compress more at the expense of compression speed (about 2x slower in my testing); they do have some impact but not much, so I guess -9 is the best option overall as long as users have enough memory (see the footnote below). Note that the decompression speed was insignificant except that -9/-9e were slightly faster than others (probably due to less I/O overhead).

[1] All dictionary compression scheme requires a certain amount of previously decoded data. In gzip this is not significant (~64K) but for costlier options of xz this may be significant: -9 requires 65 megabytes of memory for example.

lifthrasiir on 26 Mar 2017

I followed the first steps suggested by @alexcrichton without encountering any significant issue.
@brson, should we start designing the new format of the manifest? What is the best place for doing that?

ranma42 on 17 Apr 2017

@ranma42 oh @brson and I discussed this a long time ago actually and we were both on board with just adding a new key to the manifest. Right now all artifacts have url = "..." which points to a *.tar.gz, and we'd just add a new key, xz-url = "..." which points to a *.tar.xz (or whatever format we select).

alexcrichton on 18 Apr 2017

oh and similar to hash = "..." we'd have xz-hash = "..." for each artifact

alexcrichton on 18 Apr 2017

@alexcrichton a dash in the field name will prevent Target from being RustcEncodable and instead require explicit serialization/deserialization as mentioned here. Is that ok?

Given the proposed approach, I assume that there are no plans to add other formats in the future. Another option might be to add a sources array containing structs that have format, hash and url keys (or a BTreeMap in which the format is the key?). This would make it trivial to add/remove formats from the manifest without changing its schema. The tools (is there any other tool beside rustup consuming the manifest?) could just ignore those that are not supported or disabled for some reason.

ranma42 on 18 Apr 2017

Oh so the serde version of toml takes care of tha just fine (via serde attributes) and the old rustc-serialize version actually handled it as well (translating deserializing into a rust field named foo_bar to read from a TOML key foo-bar)

I think we're definitely open to new formats in the future, we'd just add more keys. We could support a generic container (like a list) for the formats but it didn't really seem to give much benefit over just listing everything manually. Downloaders will basically always look for an exact one and otherwise fall back to tarballs.

alexcrichton on 18 Apr 2017

I implemented the changes required to get the xz url and hash here, but I keep getting the _ in the field names in the manifests. What should I do to get the fields written as xz-url? Should I use a different version of rustc-serialize?

ranma42 on 25 Apr 2017

Oh ideally we'd switch to serde, but I wouldn't really worry about it, it's not that important. Due to bootstrapping using serde in the compiler is difficult right now, unfortunately.

alexcrichton on 25 Apr 2017

Then I will leave the manifest fields as xz_url and xz_hash for the time being and start updating rustup :)
I have opened https://github.com/rust-lang/rust-installer/pull/57 and I was waiting for that to be merged before submitting the PR against rust, to update the submodule to the merge commit. Should I just open the new PR and then update it as needed?

ranma42 on 25 Apr 2017

The next version of rustup should include https://github.com/rust-lang-nursery/rustup.rs/pull/1100, hence it should use XZ by default (if available).

ranma42 on 24 May 2017

🎉2

And rustup has now shipped!

alexcrichton on 13 Jun 2017

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Tracking issue for Pin APIs (RFC 2349)

withoutboats · 211Comments

Allocator traits and std::heap

nikomatsakis · 412Comments

const fn tracking issue (RFC 911)

nikomatsakis · 274Comments

Tracking issue for "Macros 1.1" (RFC #1681)

nikomatsakis · 268Comments

Tracking issue for TryFrom/TryInto traits

alexcrichton · 240Comments