Rust: Consider disabling compression for rlibs and bytecode files

Created on 11 Oct 2016 · 29Comments · Source: rust-lang/rust

One of the hottest functions in rustc is tdefl_compress, which is called from deflate_bytes. It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

If we simply turned off compression in these two places we would get sizeable speed-ups. The following numbers are for a proof-of-concept patch, doing debug builds with a stage1 compiler.

futures-rs-test  4.632s vs  4.588s --> 1.009x faster (variance: 1.013x, 1.012x)
helloworld       0.249s vs  0.250s --> 0.997x faster (variance: 1.014x, 1.015x)
html5ever-2016-  7.967s vs  7.791s --> 1.023x faster (variance: 1.004x, 1.016x)
hyper.0.5.0      5.424s vs  5.177s --> 1.048x faster (variance: 1.004x, 1.006x)
inflate-0.1.0    5.013s vs  4.945s --> 1.014x faster (variance: 1.009x, 1.017x)
issue-32062-equ  0.367s vs  0.364s --> 1.008x faster (variance: 1.013x, 1.017x)
issue-32278-big  1.812s vs  1.810s --> 1.001x faster (variance: 1.007x, 1.008x)
jld-day15-parse  1.638s vs  1.606s --> 1.020x faster (variance: 1.001x, 1.012x)
piston-image-0. 12.522s vs 12.236s --> 1.023x faster (variance: 1.029x, 1.004x)
regex.0.1.30     2.684s vs  2.511s --> 1.069x faster (variance: 1.018x, 1.013x)
rust-encoding-0  2.232s vs  2.134s --> 1.046x faster (variance: 1.008x, 1.010x)
syntex-0.42.2   34.353s vs 33.205s --> 1.035x faster (variance: 1.011x, 1.013x)
syntex-0.42.2-i 18.848s vs 17.033s --> 1.107x faster (variance: 1.004x, 1.035x)

regex and syntex-incr are the biggest wins.

The obvious downside is that the size of the relevant files is larger. So we need to decide if this we are happy with this trade-off, accepting larger files for faster compilation.

Source

nnethercote

All 29 comments

CC @eddyb, @Mark-Simulacrum

nnethercote on 11 Oct 2016

Maybe we should switch to a faster compression algorithm - e.g. lz4?

arielb1 on 11 Oct 2016

Something that could be considered is an algorithm that is seekable, so that we also could decompress on-demand (as we use absolute/relative positioning in the metadata blob).
If we can go back to compressing the metadata blobs in rlibs _without_ noticeable slowdown we could get rid of the leb128 encoding and just write integers in little-endian as they would be in memory.
Then again, it might not be worth it.

eddyb on 11 Oct 2016

we would get sizeable speed-ups

How much larger the rlibs become, though? 30%?

nagisa on 11 Oct 2016

For reference:

Alright so given the numbers #6954, I don't think LZ4 is worth it, and neither does @brson.

from https://github.com/rust-lang/rust/issues/6902#issuecomment-19015982

To be honest, I don't think that the speed gains listed above warrant turning compression off. How hard would it be to compress things on a background thread? Compression could be done in parallel with codegen, maybe?

michaelwoerister on 11 Oct 2016

rlibs are already uncompressed except bitcode. Metadata in dylibs is compressed.

I _think_ we turned it off in rlibs not because of the one-time compression cost, but because of decompression in each user crate (it also means no mmap is possible).

eddyb on 11 Oct 2016

It's used in two places: crate metadata in rlibs, and LLVM bytecode files.

As @eddyb mentioned rlib metadata isn't compressed, as it's intended to be mmap'd. So that also makes me curious where this compression is called from? A standard compilation should not take a look at the bytecode (e.g. it shouldn't need to decompress it), it's only there for LTO builds. Or that's at least the state of the world as of a few years ago when I implemented LTO...

@nnethercote are you sure that this function is mostly being called from decompression of metadata and/or bytecode? I could imagine that this showing up with _compressing_ bytecode but not during a normal compile...

alexcrichton on 11 Oct 2016

tdefl_compress _is_ compression, i.e. "deflate" ("inflate" being decompression - "de" of "deflate" might be a bit confusing).

eddyb on 11 Oct 2016

Oh ok I think I misread.

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs). We can likely stomach larger rlibs (the size rarely comes up) but to truly fix this we in theory want to disable bytecode-in-rlib entirely. It's only used for LTO, which is almost never used, so we should arguably require a flag to opt-in to LTO-able-rlib which Cargo passes by default if need be.

alexcrichton on 11 Oct 2016

Then yes I think that this is purely being called from compressing bytecode, not the metadata itself (which isn't compressed in rlibs).

As the first comment says, compression occurs in two places. More specifically, here:

write_metadata, in src/librustc_trans/base.rs
link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

we could get rid of the leb128 encoding

leb128 encoding is also hot. E.g. see #37083 where a tiny improvement to read_unsigned_leb128 had a small but noticeable effect.

nnethercote on 11 Oct 2016

Any suggestions on how to move forward here? The speed-ups are large, but the disk space trade-off is such that it will be easy for inertia to win here, and I'd like to avoid that. I can take file measurements if someone suggests good examples to measure.

nnethercote on 12 Oct 2016

I suppose my personal ideal world would look a little like:

The compiler works with rlibs that have either compressed or uncompressed bytecode, some metadata says which.
A flag is added to the compiler to _compress_ the bytecode (or basically "generate the smallest, yet fastest to read rlib")
The compiler's build system passes this flag on release builders

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

alexcrichton on 12 Oct 2016

👍1

I'd prefer to have a future proof concept before doing anything here. Unless I'm overlooking something, there are only two crates where this makes a difference of more than a second and that's for debug builds of big crates. In a release build, I would suspect that the difference is well under 1% of total build time. So, I don't see an urgent need for action here. (Sorry, @nnethercote, I don't want to put a damper on your enthusiasm. It's great that you are looking into this, I just don't want to needlessly rush things)

Some questions that I'd like to have answered before proceeding:

Do we really want to make LTO opt-in? How would that interact with the possibility of machine-code-less rlibs that have been talked about from time to time. I'm worried about introducing another stable commandline option just to deprecate it again a few months later.
Is there a compression algorithm that works well with small, individually addressable items in a stream? Out of the box, LZW variants don't provide that, afaik.
Do we need a general purpose algorithm at all. Using leb128 for all numbers probably provides for pretty OK compression (but mixing leb128 with general purpose compression, like we do now, is probably a bad idea).
How much of the compressed data is crate metadata and how much is LLVM bitcode?
Could we disable compression implicitly, e.g. for debug builds, incremental compilation, and release builds that do not specify -Os or -Oz?

michaelwoerister on 12 Oct 2016

A relatively easy change that might help would be to simply reduce the compression level. There are some flags that can be changed, see here, or here. Maybe there is a better tradeoff that could be found without making massive changes. Another thing that might be worth noting is that deflate function as it's written now will make miniz allocate memory (I haven't counted the exact size, but may be closer to 1mb) for the compressor each time it's called, which may not be ideal if it's used a number of times in short succession.

oyvindln on 12 Oct 2016

I don't think we want to make LTO opt-in; Rust should support it by default, without needing to rebuild the world with it.

joshtriplett on 14 Oct 2016

Not sure how much it helps, but deflate can be parallelized.

jld on 15 Oct 2016

It's only used for LTO, which is almost never used

I disagree that LTO is almost never used. But, also, in the long run it should be used more. So, I'd rather have all libraries support LTO by default.

Rust should support it by default, without needing to rebuild the world with it.

I personally am OK with rebuilding the world to switch from non-LTO to LTO builds, as long as it doesn't require every library to opt into anything manually.

Perhaps, when building an application, Cargo could inspect the configuration and if any configuration enables LTO then it could build every library with LTO support. Otherwise, if the application doesn't enable LTO support, then it wouldn't enable LTO support. I don't know if this would need to inspect just the application's Cargo.toml or if it would need to recursively inspect all the Cargo.tomls, but this seems like just a detail.

briansmith on 18 Oct 2016

Just to point the obvious, using something different from deflate might help better and if you are already encoding the data might be worthy considering do it once and use lz4, lzo or even zstd and spare one step.

How hard would be plug any of those?

lu-zero on 18 Oct 2016

zstd is supposedly a faster, equal-compression-rate version of zlib. Someone should check that.

arielb1 on 18 Oct 2016

👍1

As the first comment says, compression occurs in two places. More specifically, here:

write_metadata, in src/librustc_trans/base.rs

link_rlib, in src/librustc_trans/back/link.rs

Both of them are significant, though how significant varies by benchmark. E.g. the former affects regex more, the latter affects syntex-incr more.

@eddyb worked out that write_metadata is compressing metadata unnecessarily for rlibs! #37267 addresses this. With that change, write_metadata still compresses for dylibs and "proc macro" crates but AIUI they are much rarer. So that's the first part of this PR addressed.

That still leaves the bytecode compression, which is the larger part of the potential speed-up. I tried disabling bytecode compression and the size of the rlibs for syntex increased from 54508120 to 75265964, a 1.38x increase, i.e. quite a bit. So unconditionally disabling it probably isn't feasible.

nnethercote on 19 Oct 2016

I did a small bit of analysis about compression algorithms and such. I extracted all bytecode from the standard distribution rlibs (e.g. everything we're shipping). I decompressed it and then recompressed it with a bunch of algorithms. Kept track of how long everything took as well as the compression ratios.

The raw data is here where each section is the statistics for one particular piece of bytecode. The final entry is the summation of everything previous. I tested:

xz compression levels 0-9
deflate compression levels (fast, default, best)
brotli compression 0-9
zstd compression 0-9

Basically what I think this tells me is that zstd is blazingly fast and gets better compression than deflate (what we're using today) at lower levels. Otherwise xz is super slow (surprise surprise) and brotli also isn't faring too well on this data set.

Now that being said, this seems like it's a tiny portion of compiles. The deflate times for any particular bytecode are in the handfuls of milliseconds at most it looks like. If we really want speed though, zstd seems the way to go.

alexcrichton on 19 Oct 2016

👍1

Thank you for the analysis, @alexcrichton. What parameters defines the "fast", "default", and "best" modes for deflate?

Now that being said, this seems like it's a tiny portion of compiles.

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

nnethercote on 19 Oct 2016

Based on the tests done by @alexcrichton (maybe you could put up the test in a github repo or something?) it seems that lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

oyvindln on 19 Oct 2016

For syntex-incr it's ~10% for a debug build! And that's a particular interesting example given that "incremental compilation is coming soon" is the standard response to any complaints about rustc's speed...

Is that still true after metadata is not compressed for rlibs anymore?

michaelwoerister on 19 Oct 2016

I think a good first step towards improvement here would be to allow for LLVM bitcode to be either compressed or not. The way we store bitcode already contains a small header that tells us about the format, so this would be easy to implement in a backwards compatible way.

With that implemented we can just forgo compression in some scenarios (like debug builds or incr. comp.) in a transparent way.

Adding support for zstd would be nice too.

michaelwoerister on 19 Oct 2016

@nnethercote

The fast/default/best correspond to what's in flate2.

@oyvindln

It's pretty janky, but the script I used is here.

@michaelwoerister

Agreed we should support uncompressed bytecode! I'd also be fine with it behind a flag that we disable by default and only enable for our own releases.

alexcrichton on 19 Oct 2016

👍1

That way the Rust installation stays the same size yet everyone would get the benefits of not having to compress bytecode. Cargo build directories in general don't matter so much more for size as the Rust installation itself I've found at least.

All said and done, my Visual Studio + Windows SDK setup is ~5GB. I wouldn't blink at the Rust installation doubling in size if it meant compilation was even slightly (a few percent) faster and/or if the toolchain was easier to maintain (i.e. if it didn't need to support any compression at all).

briansmith on 19 Oct 2016

👍1

lowering the deflate compression level could provide a nice speedup without a huge loss in compression efficiency. As it would probably only require changing 1-2 lines of code it may be a good idea to do this while deciding on and alternatively implementing a change to a different compression algorithm.

This is exactly what I've done in #37298. Thank you for the suggestion.

nnethercote on 20 Oct 2016