Rust: Tracking Issue for making incremental compilation the default for Release Builds

Created on 29 Jan 2019 · 16Comments · Source: rust-lang/rust

Since incremental compilation supports being used in conjunction with ThinLTO the runtime performance of incrementally built artifacts is (presumably) roughly on par with non-incrementally built code. At the same time, building things incrementally often is significantly faster ((1.4-5x according to perf.rlo). As a consequence it might be a good idea to make Cargo default to incremental compilation for release builds.

Possible caveats that need to be resolved:

[ ] The initial build is slightly slower with incremental compilation, usually around 10%. We need to decide if this is a worthwhile tradeoff. For debug and check builds everybody seems to be fine with this already.
[ ] Some crates, like style-servo, are always slower to compile with incr. comp., even if there is just a small change. In the case of style-servo that is 62 seconds versus 64-69 seconds on perf.rlo. It is unlikely that this would improve before we make incr. comp. the default. We need to decide if this is a justifiable price to pay for improvements in other projects.
[ ] Even if incremental compilation becomes the default, one can still always opt out of it via the CARGO_INCREMENTAL flag or a local Cargo config. However, this might not be common knowledge, the same as it isn't common knowledge that one can improve runtime performance by forcing the compiler to use just one codegen unit.
[x] It still needs to be verified that runtime performance of compiled artifacts does not suffer too much from switching to incremental compilation (see below).

Data on runtime performance of incrementally compiled release artifacts

Apart from anectodal evidence that runtime performance is "roughly the same" there have been two attempts to measure this in a more reliable way:

PR #56678 did an experiment where we compiled the compiler itself incrementally and then tested how the compiler's runtime performance was affected by this. The results are twofold:
1. In general performance drops by 1-2% (compare results for clean builds)
2. For two of the small test cases (helloworld, unify-linearly) performance drops by 30%. It is known that these test cases are very sensitive to LLVM making the right inlining decisions, which we already saw when switching from single-CGU to non-incremental ThinLTO. This is indicative that microbenchmarks may see performance drops unless the author of the benchmark takes care of marking bottleneck functions with #[inline].
For a limited period of time we made incremental compilation the default in Cargo (https://github.com/rust-lang/cargo/pull/6564) in order to see how this affected measurements on lolbench.rs. It is not yet clear if the experiment succeeded and how much useful data it collected since we had to cut it short because of a regression (#57947). The initial data looks promising: only a handful of the ~600 benchmarks showed performance losses (see https://lolbench.rs/#nightly-2019-01-27). But we need further investigation on how reliable the results are. We might also want to re-run the experiment since the regression can easily be avoided.

One more experiment we should do is compiling Firefox because it is a large Rust codebase with an excellent benchmarking infrastructure (cc @nnethercote).

cc @rust-lang/core @rust-lang/cargo @rust-lang/compiler

A-incr-comp C-tracking-issue I-compiletime T-cargo T-compiler T-core WG-compiler-performance

Source

michaelwoerister

👍4 🚀3

Most helpful comment

Yeah I'm honestly thinking that it may be time for a profile between debug and release, such that there is these use cases:

Debug: The code is compiled such that you have the best experience trying to remove bugs.
- "Development" / "Optimized": The code is incrementally compiled with some optimizations, such that it's suitable for fast development cycles and using it for everyday programming.
- Release: The code is heavily optimized, such that it can be published.

At the moment I'm seeing lots of people either sacrifice the debug profile for that "Development" use case (bumping optimization levels, but reducing the debugability of the project) or sacrifice the release profile by reducing optimizations, both are kind of suboptimal.

CryZe on 2 Feb 2019

🚀5 ❤1

All 16 comments

On Tue, Jan 29, 2019 at 02:36:31AM -0800, Michael Woerister wrote:

Data on runtime performance of incrementally compiled release artifacts

Apart from anectodal evidence that runtime performance is "roughly the same" there have been two attempts to measure this in a more reliable way:

PR #56678 did an experiment where we compiled the compiler itself incrementally and then tested how the compiler's runtime performance was affected by this. The results are twofold:

In general performance drops by 1-2% (compare results for clean builds)

For two of the small test cases (helloworld, unify-linearly) performance drops by 30%. It is known that these test cases are very sensitive to LLVM making the right inlining decisions, which we already saw when switching from single-CGU to non-incremental ThinLTO. This is indicative that microbenchmarks may see performance drops unless the author of the benchmark takes care of marking bottleneck functions with #[inline].

I'm not especially worried about the increases in compile time, as they
seem worth the cost. However, these regressions in runtime performance
don't seem reasonable to me; I don't think we should change the default
to something that has any runtime performance cost.

joshtriplett on 29 Jan 2019

👍1

I don't think we should change the default to something that has any runtime performance cost.

I'm not sure. The current default already has a quite significant runtime performance cost because it's using ThinLTO instead of -Ccodegen-units=1.

michaelwoerister on 29 Jan 2019

We've had a ton of discussions before about comiple time and runtime tradeoffs, see https://github.com/rust-lang/rust/issues/45320 and https://github.com/rust-lang/rust/issues/44941 for just a smattering. We are very intentionally not enabling the fastest compilation mode with cargo build --release by default today, and an issue like this is a continuation of that.

alexcrichton on 29 Jan 2019

@alexcrichton To avoid ambiguity, what do you mean by "fastest compilation mode" here?

I certainly think we don't need to worry about compiling as fast as possible, but I don't think our default compile should pay a runtime performance penalty like this.

joshtriplett on 30 Jan 2019

Ah by that I mean that producing the fastest code possible. Producing the fastest code by default for --release would mean things like LTO, enabling PGO, customizing the LLVM pass manager to just rerun itself to either a fixed point or until some amount of time lapses, etc.

alexcrichton on 30 Jan 2019

❤1

So if release is a "best effort at being fast while still finishing the build sometime today", can we just add a _different_ profile for "really the fastest but it'll take a day to build".

Lokathor on 2 Feb 2019

Yeah I'm honestly thinking that it may be time for a profile between debug and release, such that there is these use cases:

Debug: The code is compiled such that you have the best experience trying to remove bugs.
- "Development" / "Optimized": The code is incrementally compiled with some optimizations, such that it's suitable for fast development cycles and using it for everyday programming.
- Release: The code is heavily optimized, such that it can be published.

CryZe on 2 Feb 2019

🚀5 ❤1

https://github.com/rust-lang/cargo/issues/2007
https://github.com/rust-lang/cargo/pull/5326#issuecomment-379890817
https://github.com/rust-lang/rfcs/pull/2282

This came up a lot of times, but for some reason was never implemented. The discussions about it turned into talk about "workflows" and "profile overrides", although it's not very clear to me why:

I personally think (though I may be wrong) that the primary use-case for this is compiling specific dependencies with opts in debug mode, in which case it's unclear we even need custom profiles, and not just "being able to specify overrides for existing profiles".

lnicola on 2 Feb 2019

Here is an update regarding the lolbench experiment: The first nightly with incremental release builds is nightly-2019-01-28 and the last one is nightly-2019-02-06. The following is a list of benchmarks that showed a negative regression above 1% in the range between these dates:

I've looked through all of them and didn't find any case that looked like an actual regression. They are just slightly flaky benchmarks that show spikes before, during, and after the experiment.

Unless something went wrong with the experimental setup, it seems that incremental ThinLTO produces code that is just as fast as the one produced by regular ThinLTO. At least for the ~600 benchmarks that lolbench runs.

(@anp lolbench is so awesome!! :heart:)

michaelwoerister on 13 Feb 2019

❤1

@lnicola I've updated my custom profiles implementation now, here: https://github.com/rust-lang/cargo/pull/6676 . Maybe it can be useful for this issue.

da-x on 16 Feb 2019

👍1

This is so cool 😍.

If any perf regressions do emerge, I'd like to briefly plug sending a quick PR to lolbench so they can be caught in the future.

anp on 17 Feb 2019

@michaelwoerister I was wondering, are there more blockers to this that you know of?

alexcrichton on 22 Mar 2019

Just the ones listed in the original post. Although, maybe also Windows performance. I worked on Windows 10 for a few days and got the impression that incr. comp. is very slow there. That should be verified though.

michaelwoerister on 25 Mar 2019

@michaelwoerister Could NTFS' infamous low performance on many small files be at fault here?
Any chance we can coalesce some of those files together?

eddyb on 29 Mar 2019

I'd rather blame Windows Defender which is enabled by default on Windows 10.
If it's the case you will see it in task manager squishing everything out of single core.

mati865 on 29 Mar 2019

Re Defender, I've seen someone investigating parallel extraction of the installation packages to split the scanning across multiple cores, since it's synchronous.

Does the compiler use multiple threads when writing the files? Or, going the other way, would sticking then in a SQLite database help?

lnicola on 29 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings