Rust: Build released compiler artifacts as optimized as possible

Created on 19 Mar 2018 · 24Comments · Source: rust-lang/rust

At the moment the compiler binaries that we release are not as fast and optimized as they could be. As of https://github.com/rust-lang/rust/commit/ff227c4a2d8a2fad5abf322f6f1391ae6779197f, they are built with multiple codegen units and ThinLTO again, which makes the compiler around 10% slower than when built with a single CGU per crate. We really should be able to do better here, especially for stable releases:

At least, the compiler should be built with -Ccodegen-units=1 for stable releases.
In the medium term, the compiler might gain support for profile-guided optimization (see https://github.com/rust-lang/rust/pull/48346). Once it is available, we should use it for making the compiler itself faster. (see also symbol ordering: https://github.com/rust-lang/rust/issues/50655)
We don't use full LTO for compiling the compiler, mainly because we don't support it for Rust dylibs. We should review if this restriction is still current, and, if we can lift it, enable full LTO.

@rust-lang/release @rust-lang/infra, how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

A-rustbuild C-tracking-issue I-compiletime T-infra WG-compiler-performance

Source

michaelwoerister

❤5 👍1

Most helpful comment

To make sure we set expectations, the 10% number of perf is not 10% slower, it's "executes 10% more instructions". The change in the number of instructions is often an indicator that there could be a regression but it does not translate to a 10% slowdown in literal wall time. For example the wall-time measurements for that commit shows the worst regression, percentage-wise, as 0.49s to 0.56s. Large benchmarks like servo-style-opt got at worse 3.8% slower in a clean build from scratch, going from 75 to 78 seconds.

I mean to point this out in terms of reducing the number of codegen units or PGO or those sorts of optimizations aren't really silver bullets. They're incredibly expensive optimizations for a few seconds here and there as opposed to major optimizations across the board.

alexcrichton on 19 Mar 2018

❤2 👍1

All 24 comments

alexcrichton on 19 Mar 2018

❤2 👍1

@alexcrichton thanks for clarifying.

SimonSapin on 19 Mar 2018

@alexcrichton Yes, I know that this won't make the compiler massively faster. On the other hand, it's not uncommon that we spend weeks of developer time on getting a 5% compile time improvement. If there's the opportunity of making the compiler 10% faster by letting a build machine chew on it for a few hours every six weeks, I think we should take it.

That being said, I don't underestimate the complexity of our CI. I just don't want us to disregard the opportunity from the beginning. Maybe there is a simpler solution that would get us 90% of the way.

michaelwoerister on 20 Mar 2018

👍1

Moving to opt-level=3 can speed up up to 2%, but it's blocked on a Windows codegen bug. See also: #48204.

ishitatsuyuki on 20 Mar 2018

👍2

@andjo403's comments on gitter have given me the idea that we could also try to build LLVM with PGO. I realize of course that this would require lots of new infrastructure support and isn't something that can be implemented quickly.

michaelwoerister on 20 Mar 2018

👍1

Some updates here:

In an heroic effort, @alexcrichton and @kennytm are working on switching the compiler's C++ code to be built with Clang 6.0 (#50200) which promises to speed up the compiler by a few percent.
Using Clang will open up the possibility to use linker-based ThinLTO, which does not seem to have problems with Rust dylibs. This should give another few percent in compiler performance.
Another option for making the compiler faster is optimizing the order in which sections/symbols are emitted into object files (Chrome does this and Firefox might soon do it too).

michaelwoerister on 7 May 2018

I opened a separate issue for symbol ordering: https://github.com/rust-lang/rust/issues/50655

michaelwoerister on 11 May 2018

windows-gnu remains the only Tier 1 platform still using GCC instead of Clang to build LLVM.
I decided to take a look at it and the results are:

Clang 7.0.0 with ld:
Because of alignment bug in ld (fixed in Binutils trunk recently), dbg! macros and few other things cause runtime failure.
Clang 7.0.0 with lld (downloaded from https://llvm.org):
lld 7.x isn't fully compatible with libraries built by GNU toolchain and requires rebuilding sysroot with LLVM toolchain.
Clang trunk with lld:
lld trunk is said to be compatible with GNU based sysroots. I haven't tested it but it won't be problem for me to test if there is interest.

mati865 on 2 Dec 2018

@alexcrichton & @nnethercote: Thanks to you we have pipelining now and our bootstrap time should be quite a bit shorter, right? (according to this: https://gistpreview.github.io/?74d799739504232991c49607d5ce748a)

Can we switch compiler back to -Ccodegen-units=1? That might be a 10% performance win right there!

michaelwoerister on 2 Oct 2019

We're unfortunately way too close to 4 hours and frequently going over I think today to be able to afford going back to codegen-units=1. Pipelining I think doesn't help us too much on CI since we only have 2 cores currently so we're not getting the advantage of -j28 like that graph shows :)

Mark-Simulacrum on 2 Oct 2019

I am surprised that the simple rustc_codegen_utils takes 18s, while the way more complex rustc_codegen_ssa takes 24s in the timings of @michaelwoerister.

bjorn3 on 2 Oct 2019

since we only have 2 cores

:scream:

michaelwoerister on 2 Oct 2019

But as there only is 2 cores are we sure that codegen-units=1 is not faster?

andjo403 on 2 Oct 2019

My understanding is that LLVM is faster at optimizing smaller modules (not altogether non-obvious, I think, though certainly interesting). That means that splitting the same IR into more modules can produce faster builds, even with just one core.

Mark-Simulacrum on 2 Oct 2019

That means that splitting the same IR into more modules can produce faster builds, even with just one core.

On the other hand we'd skip the entire ThinLTO step... let me give it a try locally.

michaelwoerister on 2 Oct 2019

I would personally agree with @Mark-Simulacrum that we're extremely strapped for time budget on CI right now, and the longest builders are the Windows release builders. We should be extremely careful about making them slow (aka losing parallelism) and we're also hoping to get 4-core machines at some point which may change the calculus in terms of whether 2 cores + pipelining gives us sufficient parallelism or not.

alexcrichton on 2 Oct 2019

👍1

My local test for ./x.py -j2 dist on Linux gave me ~40 minutes for 1 CGU and ~37 minutes for 16 CGUs, so the one CGU case is indeed a bit slower (although it's not as extreme as in the past).

michaelwoerister on 2 Oct 2019

@michaelwoerister said this at the start:

how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.

From subsequent comments it seems like this point might be getting overlooked? We wouldn't do this for all CI builds, just those generating stable releases. How often are stable releases generated?

nnethercote on 3 Oct 2019

We build stable artifacts approximately once every 6 weeks. While I believe the CI platform we're currently on, Pipelines, does not have strict timeouts, I would rather avoid having to wait for more than the existing 4+ hours for a full stable build. Plus, optimizations in this area are plausibly likely to introduce regressions, right? I guess that might be rare, but I believe it is non-theoretical that changes to codegen units in how we build the compiler have caused bugs in the past; I could be wrong about this claim.

Mark-Simulacrum on 3 Oct 2019

👍1

I grepped for past PRs and I have no idea what's the current state of distribution builds: it seems the last documented change was https://github.com/rust-lang/rust/issues/45444, which means that codegen-units=1 and lto=no? (Of course that seems a bit old, which is weird.)

What is the current state?

ishitatsuyuki on 3 Oct 2019

@nnethercote to add to what @Mark-Simulacrum already mentioned I personally think we also derive a lot of value from stable/beta/nightly releases all being produced exactly the same way. That way we can exclude a class of bugs where stable releases are buggy due to how they're built but beta/nightly don't have the same bugs. (for example this would help prevent a showstopper bug on either beta or stable). There's also enough users of non-stable that producing quite-fast compilers on nightly and such is relatively important.

If we try to build a full release every night, however, that's where it gets pretty onerous to make release builds slower. That'd happen at least once a day (multiple times for stable/beta), and that runs the risk of being even slower than we currently are, which is already sort of unbearably slow :(

@ishitatsuyuki I believe the current state is that libstd is built with one CGU and all rustc crates are built with 16 CGUs and have ThinLTO enabled for each crate's set of CGUs.

alexcrichton on 3 Oct 2019

I agree that we should release what we regularly test. Thanks for pointing that out.

nnethercote on 4 Oct 2019

Here's a possibly interesting thought: PGO speeds up Firefox quite a bit (5-10%). Maybe it would be possible to harness PGO for our LLVM builds? We rebuild LLVM only very infrequently and fall back on a cached version for the rest of the time. We just would need a way to fill the cache with a PGO'ed version of LLVM (which is kind of complicated I guess).

Anyway, a starting point would be to do a local test and see if there are actual performance improvements to be had.

michaelwoerister on 4 Dec 2019

👍1

-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld generate a broken rustc:
rustc[2418] trap invalid opcode ip:7efd8ca7cef8 sp:7efd87acfa40 error:0 in libstd-71e59b47b634435d.so[7efd8ca45000+83000]
It execute to a ud2 instruction