Today’s rust-nightly-x86_64-unknown-linux-gnu.tar.gz is 125MiB in size. I did a make dist-tar-bins
which output the same tarball, but only 88MiB in size. This is 70% of whatever we publish to s3.
I took liberty to also test:
I strongly propose to either migrate to a more modern compression algorithm (xz) or at least investigating why gzip does such a bad job on the build bots.
cc @brson
Apparently our s3 distribution includes docs, which my make dist-tar-bins
didn’t include for some reason. That’s the cause of discrepancy between gzip sizes on my system and what’s shipped through s3.
I downloaded the tarball we ship and recompressed it for a fairer comparison:
The tarball (2015-03-03) has since increased to 138MiB in size. Recompressing that to xz
results in 82MiB .tar.xz
, for a 40% gain.
Do we know of statistics of how many systems can decompress a xz
archive? I'm mostly just curious as I suspect that we would continue to produce both and then systems like multirust
could figure out which to download based on the host system.
As I and others discussed on IRC:
On linux systems some very common packages (with defaultish configure options) depend on liblzma5/xz (gdb and systemd come to mind as examples) so it is very likely it will be available on a standard linux system.
OS X does support xz by default in its 'tar' command (which is bsdtar - not sure exactly when support was introduced, but I think it was in 10.9) and Archive Utility (apparently newly in 10.10). This works via a library rather than the xz command line utility, which is not provided.
Are you sure it is not just shelling out to the xz command? Does 'tar xJf
file' work without xz executable in the path?
EDIT: never mind, apparently bstdar just links and uses liblzma5 directly. This is very nice, I should consider dropping gnu tar and use bsdtar myself.
Either way, this means that using xz will benefit most of the users of both linux and os x.
2015/03/06 6:52 "comex" [email protected]:
OS X does support xz by default in its 'tar' command (which is bsdtar -
not sure exactly when support was introduced, but I think it was in 10.9)
and Archive Utility (apparently newly in 10.10).—
Reply to this email directly or view it on GitHub
https://github.com/rust-lang/rust/issues/21724#issuecomment-77506380.
The 7-Zip Utility for Windows can decompress xz archives, according to Wikipedia. That's the only archiver I use on Windows. It's free and open source, but it is third-party.
However, isn't the installer the preferred way of getting Rust on Windows? That's what I use. I don't know how the installer does decompression but xz support will probably have to be implemented for it.
In addition to 7-Zip (which invented the underlying LZMA2 compression format) XZ is also supported by all the other major tools not already mentioned:
Edit: I need to stop forgetting to double-check my memory of what download pages are offering before posting. I've trimmed out some irrelevant bits.
Strong +1 to @alexcrichton 's suggestion that we simply provide both. It costs us relatively nothing to construct both artifacts; is there any serious cost to provide them both (e.g. are we worried about connection charges or storage space on our servers?)
An update since the metadata reform. Now gzipped tarball is 107MB. xzipped is 78MB. Still an easy 30MB win.
EDIT: docless xz: 75MB, docless gzip: 100MB.
Is this still an issue today?
Recompressing gzip from https://static.rust-lang.org/dist/rust-1.4.0-x86_64-unknown-linux-gnu.tar.gz to xz still goes from 97MB to 75MB, a win of 22MB.
FYI, nowadays even Busybox supports xz
We now have switched to the full Rust solution for distribution, so we can easily switch from tar.gz to tar.xz in that case.
I recently had a chance to live on data-capped tether for 2 weeks and it hurt me very hard when the new stage0 compiler got in. It took me a considerable amount of time to download the new compiler and put a noticeable dent into my data allowance. Both of those would have been much more bearable with xz
.
Please provide .xz downloads for the source tarballs. I am a packager for Mageia Linux and downloading the tar.gz and then uploading it our tarballs server over my slow ADSL upstream is time-consuming. I tried to compress the tar.gz tarball using xz -9 --extreme
and the savings are significant:
shlomif[rpms]:$mageia/rust/SOURCES$ ls -l rustc-1.11.0-src.tar.gz ~/rustc-1.11.0-src.tar.xz
-rw-r--r-- 1 shlomif shlomif 17108400 Sep 8 20:56 /home/shlomif/rustc-1.11.0-src.tar.xz
-rw-r--r-- 1 shlomif shlomif 26126471 Aug 16 13:39 rustc-1.11.0-src.tar.gz
That's a 34% saving.
I'd like to make this happen but it's quite complex to do. I think the basic way to do it is to recompress all the tarballs in one batch job at the same time during final manifest generation. It would be great to do it in a way that isn't conflated with other parts of the build infrastructure, so that it can be developed and tested independently of buildbot. Unfortunately the way the entire set of artifacts is put together is quite complex. I tried to write up a design that somebody else could implement but got pretty discouraged.
But some requirements I think
I do want to redesign the entire release build process, and it might be easier to make this happen as part of a redesign.
Compressing just the source tarball can probably be done relatively easily by modifying the build system with a --xz-source-tarball
flag which we could enable on the linux bots.
If we're counting calories, stripping .rustc
from rustc/lib/*.so
(but not rustlib/
!) saves about 9MB from the current unpacked nightly dist, and that savings even translates directly to compressed forms since .rustc
was already compressed.
There are also a few spots of debuginfo, but that saves less.
Ok, I think nowadays we're quite ready to be poised to do this! Specifically I believe the steps would look like:
dist.rs
where all distribution related code lives.cat $tarball | gunzip | xz > $tarball.xz
xz-url = '...'
keys to the existing url
keys (along with a hash). @brson or I should be contacted about this.I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
The results and the Makefile
I used for my experiments are available here.
cat $tarball | gunzip | xz > $tarball.xz
reduces the latest tarball from 135 MB to 95 / 90 MB (without / with the -9
flag).
Changing the order in which tar
stores the files makes it possible to compress it down to 79 / 58 MB (without / with the -9
flag).
I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :)
@ranma42 holy cow I had no idea we could get such a drastic improvement by reordering files, that's awesome!
FWIW the tarball creation itself is likely buried in rust-installer which may be difficult to modify but not impossible! Eventually I'd love to completely rewrite the rust-installer repo in Rust itself (e.g. src/tools/rust-installer
) but that doesn't have to necessarily happen before this.
I've tried to get better results than @ranma42 with both brotli and zstd at maximum compression settings (22
for zstd, 11
for brotli) but they were far behind xz compression at default setting (for rust-nightly-x86_64-unknown-linux-gnu.tar.gz):
460M files.gnu.tar
125M files.gnu.tar.bz2
96M files.gnu.tar.xz
92M files.gnu.tar.xz9
103M files.gnu.tar.zst
460M rev-sorted-files.gnu.tar
88M rev-sorted-files.gnu.tar.bro
123M rev-sorted-files.gnu.tar.bz2
79M rev-sorted-files.gnu.tar.xz
58M rev-sorted-files.gnu.tar.xz9
85M rev-sorted-files.gnu.tar.zst
For the rust-src-nightly.tar.gz they were behind as well, just not as far:
181M files.gnu.tar
25M files.gnu.tar.bz2
31M files.gnu.tar.gz
22M files.gnu.tar.xz
21M files.gnu.tar.xz9
23M rev-sorted-files.gnu.tar.bro
32M rev-sorted-files.gnu.tar.gz
22M rev-sorted-files.gnu.tar.xz
22M rev-sorted-files.gnu.tar.xz9
23M rev-sorted-files.gnu.tar.zst
Also note that brotli takes far longer to compress than any other algo. In the second diagram, you can see that reverse sorting in fact has a tiny negative inpact for source code. I too would suggest to go with xz at level 9 with reverse ordering, as its a) far more widespread than zstd/brotli and b) possible better decompression speed for zstd/brotli is an unimportant advantage.
I wonder if we can improve the sorting by either using a similarity hash and order by hash value, or even use a distance metric and Floyd-Warshall to find out the cheapest path through all files.
Then again that's probably overdoing it.
@llogiq the "reverse name sorting" trick is a cheap approximation of that, because it clusters files with the same extension. In the case of rust object files, it is effectively also sorting them by their hash, ensuring that identical libraries are adjacent in the list.
If we want to squeeze the tarball further, I would suggest investigating the biggest files in the release:
librustc_llvm.so
is 62 MB, but I do not expect it to squeeze easilycargo
binary is 37 MB and it looks like it statically links all of its dependencies; maybe it would be possible to dynamically link some of them to reduce its size? (strip
can squeeze it down to 9 MB, but this would make the debugging experience worse)Perhaps we should setup stripped binaries after all – as the savings are substantial. It may allow some people to use Rust who currently cannot afford it.
The difference between fully stripped and not stripped when decompressed is 120MB. Difference when compressed (for sorted files) is 8MB.
Being bold, we could also think of every single function as one "file", reorder those using similarity hashes (or floyd-warshall, although I guess the number would be too high for pure floyd-warshall), and provide a self extracting archive or something. That would solve the "cargo links everything statically" problem.
Just in case I have tested other options with xz
in search of more size reduction and less memory in decompression [1]: (The archive used is 2017-03-15 nightly, which should be same to @ranma42's)
-rw-rw-r-- 1 lifthrasiir 81628064 Mar 26 22:41 rev-sorted-files.bsd.tar.xz
-rw-rw-r-- 1 lifthrasiir 81080832 Mar 26 22:47 rev-sorted-files.bsd.tar.xz6e
-rw-rw-r-- 1 lifthrasiir 74531756 Mar 26 22:50 rev-sorted-files.bsd.tar.xz7
-rw-rw-r-- 1 lifthrasiir 73955700 Mar 26 22:56 rev-sorted-files.bsd.tar.xz7e
-rw-rw-r-- 1 lifthrasiir 74053348 Mar 26 23:00 rev-sorted-files.bsd.tar.xz8
-rw-rw-r-- 1 lifthrasiir 73494812 Mar 26 23:06 rev-sorted-files.bsd.tar.xz8e
-rw-rw-r-- 1 lifthrasiir 60213532 Mar 26 23:10 rev-sorted-files.bsd.tar.xz9
-rw-rw-r-- 1 lifthrasiir 59672056 Mar 26 23:17 rev-sorted-files.bsd.tar.xz9e
*.xz
file corresponds to the default option (-6
). I've also tested -6e
thorugh -9e
which tries to compress more at the expense of compression speed (about 2x slower in my testing); they do have some impact but not much, so I guess -9
is the best option overall as long as users have enough memory (see the footnote below). Note that the decompression speed was insignificant except that -9
/-9e
were slightly faster than others (probably due to less I/O overhead).
[1] All dictionary compression scheme requires a certain amount of previously decoded data. In gzip this is not significant (~64K) but for costlier options of xz
this may be significant: -9
requires 65 megabytes of memory for example.
I followed the first steps suggested by @alexcrichton without encountering any significant issue.
@brson, should we start designing the new format of the manifest? What is the best place for doing that?
@ranma42 oh @brson and I discussed this a long time ago actually and we were both on board with just adding a new key to the manifest. Right now all artifacts have url = "..."
which points to a *.tar.gz
, and we'd just add a new key, xz-url = "..."
which points to a *.tar.xz
(or whatever format we select).
oh and similar to hash = "..."
we'd have xz-hash = "..."
for each artifact
@alexcrichton a dash in the field name will prevent Target
from being RustcEncodable
and instead require explicit serialization/deserialization as mentioned here. Is that ok?
Given the proposed approach, I assume that there are no plans to add other formats in the future. Another option might be to add a sources
array containing structs that have format
, hash
and url
keys (or a BTreeMap
in which the format
is the key?). This would make it trivial to add/remove formats from the manifest without changing its schema. The tools (is there any other tool beside rustup
consuming the manifest?) could just ignore those that are not supported or disabled for some reason.
Oh so the serde version of toml takes care of tha just fine (via serde attributes) and the old rustc-serialize version actually handled it as well (translating deserializing into a rust field named foo_bar
to read from a TOML key foo-bar
)
I think we're definitely open to new formats in the future, we'd just add more keys. We could support a generic container (like a list) for the formats but it didn't really seem to give much benefit over just listing everything manually. Downloaders will basically always look for an exact one and otherwise fall back to tarballs.
I implemented the changes required to get the xz url and hash here, but I keep getting the _
in the field names in the manifests. What should I do to get the fields written as xz-url
? Should I use a different version of rustc-serialize
?
Oh ideally we'd switch to serde, but I wouldn't really worry about it, it's not that important. Due to bootstrapping using serde in the compiler is difficult right now, unfortunately.
Then I will leave the manifest fields as xz_url
and xz_hash
for the time being and start updating rustup :)
I have opened https://github.com/rust-lang/rust-installer/pull/57 and I was waiting for that to be merged before submitting the PR against rust, to update the submodule to the merge commit. Should I just open the new PR and then update it as needed?
The next version of rustup should include https://github.com/rust-lang-nursery/rustup.rs/pull/1100, hence it should use XZ by default (if available).
And rustup has now shipped!
Most helpful comment
I did some experiments with the compression by also tuning the order in which files are included in the archive and it looks like we might get further improvements. This is basically achieved by storing duplicate files one after the other, so that the stream compressor can encode them more efficiently.
The results and the
Makefile
I used for my experiments are available here.cat $tarball | gunzip | xz > $tarball.xz
reduces the latest tarball from 135 MB to 95 / 90 MB (without / with the-9
flag).Changing the order in which
tar
stores the files makes it possible to compress it down to 79 / 58 MB (without / with the-9
flag).I would like to work on this issue, but I might not be able to do so until the end of the month. If somebody else starts implementing the new system, please ping here :)