At the moment the compiler binaries that we release are not as fast and optimized as they could be. As of https://github.com/rust-lang/rust/commit/ff227c4a2d8a2fad5abf322f6f1391ae6779197f, they are built with multiple codegen units and ThinLTO again, which makes the compiler around 10% slower than when built with a single CGU per crate. We really should be able to do better here, especially for stable releases:
-Ccodegen-units=1
for stable releases. @rust-lang/release @rust-lang/infra, how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.
To make sure we set expectations, the 10% number of perf is not 10% slower, it's "executes 10% more instructions". The change in the number of instructions is often an indicator that there could be a regression but it does not translate to a 10% slowdown in literal wall time. For example the wall-time measurements for that commit shows the worst regression, percentage-wise, as 0.49s to 0.56s. Large benchmarks like servo-style-opt got at worse 3.8% slower in a clean build from scratch, going from 75 to 78 seconds.
I mean to point this out in terms of reducing the number of codegen units or PGO or those sorts of optimizations aren't really silver bullets. They're incredibly expensive optimizations for a few seconds here and there as opposed to major optimizations across the board.
@alexcrichton thanks for clarifying.
@alexcrichton Yes, I know that this won't make the compiler massively faster. On the other hand, it's not uncommon that we spend weeks of developer time on getting a 5% compile time improvement. If there's the opportunity of making the compiler 10% faster by letting a build machine chew on it for a few hours every six weeks, I think we should take it.
That being said, I don't underestimate the complexity of our CI. I just don't want us to disregard the opportunity from the beginning. Maybe there is a simpler solution that would get us 90% of the way.
Moving to opt-level=3 can speed up up to 2%, but it's blocked on a Windows codegen bug. See also: #48204.
@andjo403's comments on gitter have given me the idea that we could also try to build LLVM with PGO. I realize of course that this would require lots of new infrastructure support and isn't something that can be implemented quickly.
Some updates here:
I opened a separate issue for symbol ordering: https://github.com/rust-lang/rust/issues/50655
windows-gnu
remains the only Tier 1 platform still using GCC instead of Clang to build LLVM.
I decided to take a look at it and the results are:
ld
:ld
(fixed in Binutils trunk recently), dbg!
macros and few other things cause runtime failure.lld
(downloaded from https://llvm.org):lld
7.x isn't fully compatible with libraries built by GNU toolchain and requires rebuilding sysroot with LLVM toolchain.lld
:lld
trunk is said to be compatible with GNU based sysroots. I haven't tested it but it won't be problem for me to test if there is interest.@alexcrichton & @nnethercote: Thanks to you we have pipelining now and our bootstrap time should be quite a bit shorter, right? (according to this: https://gistpreview.github.io/?74d799739504232991c49607d5ce748a)
Can we switch compiler back to -Ccodegen-units=1
? That might be a 10% performance win right there!
We're unfortunately way too close to 4 hours and frequently going over I think today to be able to afford going back to codegen-units=1. Pipelining I think doesn't help us too much on CI since we only have 2 cores currently so we're not getting the advantage of -j28 like that graph shows :)
I am surprised that the simple rustc_codegen_utils takes 18s, while the way more complex rustc_codegen_ssa takes 24s in the timings of @michaelwoerister.
since we only have 2 cores
:scream:
But as there only is 2 cores are we sure that codegen-units=1 is not faster?
My understanding is that LLVM is faster at optimizing smaller modules (not altogether non-obvious, I think, though certainly interesting). That means that splitting the same IR into more modules can produce faster builds, even with just one core.
That means that splitting the same IR into more modules can produce faster builds, even with just one core.
On the other hand we'd skip the entire ThinLTO step... let me give it a try locally.
I would personally agree with @Mark-Simulacrum that we're extremely strapped for time budget on CI right now, and the longest builders are the Windows release builders. We should be extremely careful about making them slow (aka losing parallelism) and we're also hoping to get 4-core machines at some point which may change the calculus in terms of whether 2 cores + pipelining gives us sufficient parallelism or not.
My local test for ./x.py -j2 dist
on Linux gave me ~40 minutes for 1 CGU and ~37 minutes for 16 CGUs, so the one CGU case is indeed a bit slower (although it's not as extreme as in the past).
@michaelwoerister said this at the start:
how can we decouple builds of stable releases from the regular CI builds that are timing out so much lately. There should be a way of doing these builds without the severe time limits that we have in regular CI.
From subsequent comments it seems like this point might be getting overlooked? We wouldn't do this for all CI builds, just those generating stable releases. How often are stable releases generated?
We build stable artifacts approximately once every 6 weeks. While I believe the CI platform we're currently on, Pipelines, does not have strict timeouts, I would rather avoid having to wait for more than the existing 4+ hours for a full stable build. Plus, optimizations in this area are plausibly likely to introduce regressions, right? I guess that might be rare, but I believe it is non-theoretical that changes to codegen units in how we build the compiler have caused bugs in the past; I could be wrong about this claim.
I grepped for past PRs and I have no idea what's the current state of distribution builds: it seems the last documented change was https://github.com/rust-lang/rust/issues/45444, which means that codegen-units=1
and lto=no
? (Of course that seems a bit old, which is weird.)
What is the current state?
@nnethercote to add to what @Mark-Simulacrum already mentioned I personally think we also derive a lot of value from stable/beta/nightly releases all being produced exactly the same way. That way we can exclude a class of bugs where stable releases are buggy due to how they're built but beta/nightly don't have the same bugs. (for example this would help prevent a showstopper bug on either beta or stable). There's also enough users of non-stable that producing quite-fast compilers on nightly and such is relatively important.
If we try to build a full release every night, however, that's where it gets pretty onerous to make release builds slower. That'd happen at least once a day (multiple times for stable/beta), and that runs the risk of being even slower than we currently are, which is already sort of unbearably slow :(
@ishitatsuyuki I believe the current state is that libstd is built with one CGU and all rustc crates are built with 16 CGUs and have ThinLTO enabled for each crate's set of CGUs.
I agree that we should release what we regularly test. Thanks for pointing that out.
Here's a possibly interesting thought: PGO speeds up Firefox quite a bit (5-10%). Maybe it would be possible to harness PGO for our LLVM builds? We rebuild LLVM only very infrequently and fall back on a cached version for the rest of the time. We just would need a way to fill the cache with a PGO'ed version of LLVM (which is kind of complicated I guess).
Anyway, a starting point would be to do a local test and see if there are actual performance improvements to be had.
-Clinker-plugin-lto -Clinker=clang -Clink-arg=-fuse-ld=lld
generate a broken rustc
:
rustc[2418] trap invalid opcode ip:7efd8ca7cef8 sp:7efd87acfa40 error:0 in libstd-71e59b47b634435d.so[7efd8ca45000+83000]
It execute to a ud2
instruction
Most helpful comment
To make sure we set expectations, the 10% number of perf is not 10% slower, it's "executes 10% more instructions". The change in the number of instructions is often an indicator that there could be a regression but it does not translate to a 10% slowdown in literal wall time. For example the wall-time measurements for that commit shows the worst regression, percentage-wise, as 0.49s to 0.56s. Large benchmarks like servo-style-opt got at worse 3.8% slower in a clean build from scratch, going from 75 to 78 seconds.
I mean to point this out in terms of reducing the number of codegen units or PGO or those sorts of optimizations aren't really silver bullets. They're incredibly expensive optimizations for a few seconds here and there as opposed to major optimizations across the board.