There's been some talk about switching RLIBs to "MIR-only", that is, make RLIBs contain only the MIR representation of a program and not the LLVM IR and machine code as they do now. This issue will try to collect some advantages, disadvantages, and other concerns such an approach would entail:
-C metadata
).libstd
is compiled with -Cdebuginfo=1
, which is good in general but as a side-effect increases the size of Rust binaries, even if they are built without debuginfo (because the debuginfo from libstd
gets statically linked into the binaries). This problem would not exist with MIR-only rlibs.libstd
(see #38699), as @japaric points out.libstd
) can be compiled with -C target-cpu=native
, potentially resulting in better code, as @japaric points out.pub #[no_mangle]
items being exported from RLIBs and link against them directly. This would not be possible anymore, as @nagisa points out.cfg
switches, MIR is not platform independent either.-C codegen-units
already, which provides a means of reducing super-linear optimizations.Please help collect more data on the viability of MIR-only RLIBs.
cc @rust-lang/core @rust-lang/compiler @rust-lang/tools @rkruppe
This is also potentially breaking people who are linking to rlibs expecting them to at least expose the extern #[no_mangle]
functions like it does currently.
I did that at least once before, though the application where I did it was already very hacky for other reasons and I do not think the project is around anymore.
Another advantage I see is that pure MIR RLIBs effectively let us "recompile"
std
with different codegen options without needing std-aware Cargo or Xargo.
This is assuming that the std
component that rustup
installs will also be
pure MIR.
Basically, cargo rustc --release -- -C target-cpu=native
would optimize std
for the host CPU "on the fly". Today, this requires using Xargo to recompile
std
(i.e. RUSTFLAGS="-C target-cpu=native xargo build"
).
Other case where one uses Xargo to recompile the std
is producing an
executable that aborts on panic!
s without the overhead of landing pads (the
std
component that rustup
install contains landing pads because it's
compiled with -C panic=unwind
). With pure MIR RLIBs, after you set panic =
"abort"
in your Cargo.toml
, cargo build
will give you an executable that
doesn't contain landing pads (everything would get compiled with -C
panic=abort
).
cc @brson ^ pure MIR RLIBs would eliminate the need for std-aware Cargo and
Xargo for some scenarios.
I expect the above will also make using sanitizers (cf rust-lang/rust#38699)
straightforward. Using a sanitizer requires (re)compiling everything with an
extra LLVM pass and linking to the sanitizer runtime, which is written in C/C++.
With pure MIR RLIBs using a sanitizer would become as simple as cargo rustc --
-Z sanitizer=address
; that would compile everything, including std
, with the
extra LLVM pass and also link the runtime, which would be provided as an e.g.
librustc_asan.rlib
in the std
component.
cc @alexcrichton ^ relevant to sanitizer support
I think we support "bundling" native libraries into RLIBs. We might still need to keep supporting this, even if we don't store machine code originating from Rust?
This would be required for the "easy sanitizers" scenario I'm describing above.
Or we could ship the sanitizers as "static libraries", i.e. librustc_asan.a
.
Also, note that today one can build statically linked Rust programs using the
MUSL targets without needing to have MUSL installed because libc.a
is
embedded inside the std
rlib (libstd.rlib
) that ships with the std
component.
@japaric that鈥檚 trickly advantage as it prevents us from adding any MIR optimisations that depend on the codegen options set :) We already have one which acts upon the -Zno-landing-pads
(ergo --panic=abort
)
@nagisa @japaric isn't platform independence listed in the issue description as non advantage?
MIR-only libs would not be platform independent. One could think that that should be the case but because of cfg switches, MIR is not platform independent either.
I'd add to the advantages that it would add more parallelism, as the passes up to MIR being finished take less time than passes up to codegen being finished, and in combination with codegen-units
, you can now compile the code more in parallel than before. E.g. right now when I bootstrap the compiler, the "whole world" waits for the rustc crate to compile in a single thread. With the change, we wait less, as only its mir has to be available before we can continue. Afterwards when doing the codegen for the binary, we simply can use codegen-units to get the maximum amount of parallelism the hardware gives us.
Could you elaborate on how it would prevent you from adding such optimizations? The way I see it is that the std
component will probably continue to be compiled with -C panic=unwind
so if you then compile your app with -C panic=abort
then LLVM won't be able to optimize as well (or as fast) as if you had recompiled std
with -C panic=abort
because of the MIR optimizations you mention. However, we would still be better off than today where the std
component is shipped filled with landing pads. Or does LLVM always emit landing pads everywhere if the MIR "optimization" you mention is not present? (In that case, it no longer sounds like an optimization but more like a requirement)
If you want the most optimized code possible then, yeah, you would have to use Xargo or std-aware Cargo to opt into MIR optimizations that depend on codegen options. While you are at it you can also throw in --mir-opt-level=3
, etc.
I agree that MIR optimisations don't really prevent you to have platform agnostic MIR. As both their input and output is MIR, those optimisations could be run in the leaf crates, once the target and other info is known.
However, if earlier stages in the compiler depend on the target, which is the case with cfg, one would either have to refactor the entire compiler to understand cfg's in all later stages, or simulate compilation with all possible combinations of cfg's enabled/disabled (in the end cfg is an on/off question). The first approach will probably hugely bloat code complexity of the compiler, the second approach would bloat runtime complexity exponentially by the number of kinds of used cfg's.
So MIR will probably stay platform dependent for some time.
@est31
isn't platform independence listed in the issue description as non advantage?
I'm not sure what are trying to get at? The -C target-cpu=native
optimizations I'm referring to are about LLVM having access to the IR of all functions so it can apply autovectorization, CPU scheduling optimizations, etc. Whereas today -C target-cpu=native
is not as good because libstd.rlib
already contains machine code that was optimized with -C target-cpu=generic
. All these optimizations are "within an architecture", e.g. x86, and after e.g. cfg(target_arch)
has taken effect so I'm not sure how "platform agnostic MIR" is related
@japaric Removing landing pads from MIR is already somewhat a problem since you cannot add them back after fact, so you already lose some of the so-called advantage by being unable to reverse that. Later on we might want to add something more invasive. For a completely hypothetical example consider something resembling autovectorisation which, again, is not exactly reversible and thus -C target-feature=-stuff
would become no-op as well. -C debuginfo=2
? Stripped to keep binaries smaller because of -C debuginfo=0
before. -C debug-assertions
? No-op even without MIR optimisations as debug assertions is essentially a #[cfg]
.
So, what I鈥檓 trying to say is that specifying codegen options on leaf crates only would still not be equivalent (and diverge more over time with extra hypothetical MIR opts) to specifying the codegen option(s) for every crate.
You could (as @est31 did just now) argue for storing unoptimised MIR instead, but that, in addition to inreasing size of intermediate rlibs, serializes MIR opts.
isn't platform independence listed in the issue description as non advantage
Codegen options aren鈥檛 exactly related to platform independence in this context.
@michaelwoerister
I'm not sure if this can be listed as an advantage but pure MIR RLIBs would have prevented #38824. The TL;DR is that LLVM raises assertions when you try lower functions that take/return i128
values to PTX code / MPS430 instructions because of bugs in LLVM. With pure MIR RLIBs I expect that if the leaf crate doesn't make use of i128
at all then those functions that use i128
would never be fed into LLVM thus the LLVM assertions wouldn't have been triggered. I suppose that would be some sort of "dead code elimination" pass at the MIR level. So, basically less IR could be fed into LLVM with the right analysis.
@japaric
I'm not sure what are trying to get at?
Ah, sorry, I've misread, you only talked about codegen options.
cc @solson @oli-obk (miri develooers)
Great points @nagisa, @japaric, and @est31! I've added all of them to the list.
I don't think it will affect const evaluation one way or another, but it would help us test Miri outside rustc to be able to easily build dependencies as MIR-only rlibs (with MIR for _all_ items, not just generic/inline/constant ones like in the existing metadata).
Largely, Miri is just like another backend in this context, so it is an instance of this previously mentioned advantage:
There seems be some indication that MIR-only RLIBs would help with making the Rust compiler more backend agnostic (see WASM-related issue #38804).
The problem of caching machine code would be solved in a generalized form by incremental compilation. One has to keep in mind though that incremental compilation will produce less performant code because it prevents many opportunities for inlining.
MSVC using /LTCG:INCREMENTAL
is able to achieve LTO with incremental compilation with very fine granularity without sacrificing inlining. According to a blog post, the runtime performance cost of their incremental LTCG vs standard LTCG is less than half a percent, while providing massive gains in link time. So doing something equivalent in Rust is definitely a practical possibility, although it would require a significant amount of support from LLVM. Hopefully ThinLTO will be the magic bullet that provides the necessary support.
Regarding #[no_mangle] pub
items from rlibs: While it's unfortunate to break anyone's use case, I think this is only a minor disadvantage. It is not documented that rlibs are ordinary archives with some special contents, in fact this is an implementation detail. In addition, we've had breaking changes in compiler output (e.g., #29520, and at this very second #38876) for lesser reasons.
(I would have more sympathy if someone could give a good reason for using rlibs as archives that isn't already covered by staticlib
, cdylib
, and other existing tools.)
I'm very enthusiastic about this. I think separating the type checking and code generation into two phases is smart no matter exactly the strategy for when the MIR finally get translated. Gives us a lot of flexibility for coordinating the build. For example, we don't have to delay code generation until the final crate. Cargo itself could spawn parallel processes to do code generation for already-typechecked crates, while their downstreams continue type checking.
By collapsing duplicate monomorphizations, I'm hopeful that this will lead to significant improvements to the major disadvantages of monomorphization, the bloat and the compile time. We could end up in a position where we can say, "the generics model is like C++, but more efficient". That could be a major advantage.
One significant disadvantage with this model is link-time scalability. This will put massive memory pressure on the leaf crate builds, and that could bite us in the future as bigger projects are written in Rust.
LTO is a downside too because of compile time. I'd expect we'd need a range of strategies for the actual codegen, to accomplish different goals in -O0
vs -O3
.
The leaf crates (executables, staticlibs, dylibs, cdylibs) would take more time to compile
I鈥檝e very worried about this for Servo. There鈥檚 currently 319 crates in the dependency graph, but after an initial build only a few of them are recompiled in the typical edit-build-test cycle. Even so, compile times are already pretty bad.
Do MIR-only rlibs mean doing code generation for the entire dependency graph every time? This sounds like unacceptable explosion of compile times.
@SimonSapin I see no point in experimenting with this on Servo's scale without enabling incremental recompilation (with ThinLTO in the future, too).
Btw I hear @rkruppe is making good progress towards such a compilation mode.
We discussed this in the last @rust-lang/tools meeting and the consensus was that this looks like a good idea in many ways but we will not pursue it as long as it would mean a significant compile regression.
So then we'll be pursuing this as soon as Rust is able to fully take advantage of incremental compilation using ThinLTO?
Given the recent work to make the compiler incremental, I wonder if it will be possible to perform incremental builds at the level of individual functions, caching anything that hasn't changed. That could allow amazing feats, such as executables that are incrementally updated as the user compiles their source code.
Just jotting this down before I forget about it again: Currently, static
s are always translated locally, never cross-crate. If we stick to this, rlibs would still generate some object files that contain only statics, no code. However, that invites a bunch of headaches. For example, if a static references a function (e.g. an interrupt vector table storing function pointers), we'd need to translate those too — or remember them somewhere and use them as roots for trans item collection in downstream crates.
So it would be cleaner to also delay translation of statics to the final binary/staticlib/cdylib. This requires non-trivial refactoring though, as a lot of the current code is written under the assumption that all statics to translate come from the current crate (e.g., TransItem::Static
stores a NodeId
, not a DefId
).
It also means metadata needs a way to enumerate all the statics and other collector roots (monomorphic functions, and some more things in "eager" mode) from other crates. The information is all there, but there's no efficient/easy way to enumerate them.
We can do a stepwise migration towards MIR-only RLIBs:
#[inline]
and constants).#[inline]
)
- optimize the MIR before storing it (this exposes the issues with statics mentioned above, because we now can inline functions that contain statics and aren't marked
#[inline]
)
Can you elaborate on the parenthetical? I don't think MIR inlining on its own can have any effect on where and how statics are translated. Even when statics are lexically nested in a function, they're not part of the function's MIR. Statics are also trans-item-collect'd separately from MIR (as part of walking the HIR of the current crate), at least last time I checked.
Can you elaborate on the parenthetical?
I have been getting undefined references to statics inside functions that were inlined into other crates when compiling libstd via xargo with -Zalways-encode-mir -Zmir-opt-level=3
. I have done some digging, but didn't get to the root.
I believe I found yet another benefit to MIR-only rlibs: Currently, cargo check
builds the entire dependency graph with --emit metadata
only, to avoid running translation. The downside is that the output of --emit metadata
(*.rmeta
files) is sufficently different from normal (rlib) outputs that you need to rebuild the whole dependency graph when it's time to cargo build
(and conversely, if you have a full build and run cargo check
, all rmeta files are generated from scratch). This duplicates compilation effort and metadata on disk.
With MIR-only rlibs, the only remaining differences between rlibs and rmeta files would (1) the rlib has wrapped the metadata in an archive file, and (2) the archive includes bundled native libraries, if any. Creating the archive should have neglegible cost, so we could probably get rid of rmeta files and make --emit metadata
behave like --emit link
for non-leaf-crates. It would still need to avoid running codegen in leaf crates (so it's not quite an alias) but it would greatly reduce the aforementioned duplication.
(Going further, @nagisa (I think?) once suggested to me on IRC that metadata and machine code should be two separate files on disk. I found this appealing for other reasons, but to stay on-topic, such a split would make it possible to pick up a previously-generated rmeta file and generate all the machine code and so on from it, without recompiling the leaf crate from scratch. But that is mostly orthogonal to MIR-only rlibs, so whatever.)
@oli-obk re: the undefined references to statics: I don't have much time to investigate, but one cuplrit I can think of would be internalize_symbols
. Specifically, if a (non-pub
) static
is apparently only used from within the CGU it's defined in, it is marked as internal and LLVM will remove it if it's not accessed anywhere in that CGU.
Anyway, it would be great if you could file an issue for that (if there isn't one already) with a small test case. This is definitely a bug, but so far I don't believe it's an issue with statics getting translated locally.
uuh i'm scratching my head what i'm missing here: how is this gong to work with -C linker= ? We're relying on the fact that the hash of the input file to the linker is the same every invocation. If the objects get translated to native code before being passed to the linker, is the translation stable? Or will the targets system linker actually only ever see a single already relocated and re-ordered object file ?
This issue only affects which Rust code (or monomorphization of generic Rust code) gets translated into which LLVM compilation unit. It doesn't affect what happens afterwards with these LLVM modules, the resulting object files, etc. — and while it's plausible that MIR-only rlibs would enable more innovation in the later stages of the backend, nothing along those lines has been proposed or even discussed as far as I remember.
Another (marginal) benefit, assuming #[inline]
stops copying function bodies into multiple codegen units as discussed in the context of #44941: #[inline]
becomes less necessary (only adds inlinehint
instead of enabling inlining at all in certain cases) and less complicated (easier to explain, easier to tell if it's useful).
I've put together a proof-of-concept implementation of this in https://github.com/rust-lang/rust/pull/48373. Although the implementation crashes for many crates, I was able to collect timings for a number of projects. The tables show the aggregate time spent for various tasks while compiling the whole crate graph. In many cases we do less work overall but due to worse parallelization, wall-clock time increases. I.e. everything seems to be bottlenecked on the MIR-to-LLVM translation in the leaf crates. To me this suggests that MIR-only RLIBs are blocked on the compiler internals being parallelized.
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 33.90 | 32.52 | 95.9 % |
| LLVM function passes | 1.39 | 1.35 | 97.5 % |
| LLVM module passes | 2.18 | 1.95 | 89.8 % |
| MonoItem collection | 2.80 | 2.09 | 74.4 % |
| translation | 23.73 | 19.97 | 84.1 % |
| LLVM total | 37.46 | 35.83 | 95.6 % |
| BUILD total | 20.92 | 26.14 | 125.0 % |
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 13.11 | 7.28 | 55.6 % |
| LLVM function passes | 0.57 | 0.33 | 58.1 % |
| LLVM module passes | 0.90 | 0.44 | 48.7 % |
| MonoItem collection | 1.19 | 0.69 | 58.1 % |
| translation | 8.68 | 6.08 | 70.1 % |
| LLVM total | 14.59 | 8.06 | 55.2 % |
| BUILD total | 15.73 | 14.37 | 91.4 % |
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 109.42 | 69.17 | 63.2 % |
| LLVM function passes | 4.55 | 3.10 | 68.2 % |
| LLVM module passes | 1.63 | 1.06 | 64.7 % |
| MonoItem collection | 10.70 | 5.64 | 52.7 % |
| translation | 102.95 | 58.70 | 57.0 % |
| LLVM total | 115.60 | 73.33 | 63.4 % |
| BUILD total | 72.30 | 68.64 | 94.9 % |
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 41.19 | 48.67 | 118.1 % |
| LLVM function passes | 1.68 | 1.93 | 115.0 % |
| LLVM module passes | 0.21 | 0.22 | 107.3 % |
| MonoItem collection | 5.86 | 6.90 | 117.8 % |
| translation | 55.48 | 69.84 | 125.9 % |
| LLVM total | 43.08 | 50.82 | 118.0 % |
| BUILD total | 17.28 | 19.18 | 111.0 % |
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 33.98 | 22.55 | 66.4 % |
| LLVM function passes | 1.53 | 0.95 | 62.0 % |
| LLVM module passes | 0.30 | 0.20 | 66.8 % |
| MonoItem collection | 3.80 | 2.11 | 55.5 % |
| translation | 39.09 | 22.99 | 58.8 % |
| LLVM total | 35.81 | 23.70 | 66.2 % |
| BUILD total | 22.28 | 21.54 | 96.7 % |
| | MIR-only | regular |
|----------------------|----------|-----------|
| ripgrep | 22683 | 34239 |
| encoding-rs test | 8393 | 15116 |
| webrender | 72238 | 114239 |
| futures-rs test | 57565 | 46935 |
| tokio-webpush-simple | 27346 | 44961 |
What if we did this, but for libcore
...libstd
, at stage1? It might be worth it, despite the huge number of tests, and should be a huge improvement when running just a few tests.
(prompted by @dwijnand's comments on Discord about their workflow of changing librustc
and re-checking only one test - with incremental, most of the time is spent building libcore
...libstd
)
EDIT: here's some data, since I wanted to replicate what @dwijnand was seeing:
./x.py check src/libstd
)touch src/librustc/lib.rs
:stage1/bin/rustc
- until #53673 reaches beta)Building stage0 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 1m 01s
Building stage0 codegen artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu, llvm)
Finished release [optimized] target(s) in 48.75s
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 7m 39s
Building stage1 test artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
Finished release [optimized] target(s) in 30.12s
Most of the time is spent building libstd, which should be improved once #53673 ends up in beta (perhaps at the cost of the rustc
build time?), so the performance impact of using MIR-only rlibs might become less significant - we'll have to wait and see, I suppose.
Now that cargo passes --embed-bitcode=no
, is there anything left to do for this?
It turns out I was confused - this issue is about never going through LLVM at all, while --embed-bitcode=no
instead embeds the object code generated by LLVM.
https://rust-lang.zulipchat.com/#narrow/stream/122651-general/topic/How.20to.20learn.20more.20about.20crate.20metadata.3F/near/216169138
Most helpful comment
I've put together a proof-of-concept implementation of this in https://github.com/rust-lang/rust/pull/48373. Although the implementation crashes for many crates, I was able to collect timings for a number of projects. The tables show the aggregate time spent for various tasks while compiling the whole crate graph. In many cases we do less work overall but due to worse parallelization, wall-clock time increases. I.e. everything seems to be bottlenecked on the MIR-to-LLVM translation in the leaf crates. To me this suggests that MIR-only RLIBs are blocked on the compiler internals being parallelized.
ripgrep - cargo build
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 33.90 | 32.52 | 95.9 % |
| LLVM function passes | 1.39 | 1.35 | 97.5 % |
| LLVM module passes | 2.18 | 1.95 | 89.8 % |
| MonoItem collection | 2.80 | 2.09 | 74.4 % |
| translation | 23.73 | 19.97 | 84.1 % |
| LLVM total | 37.46 | 35.83 | 95.6 % |
| BUILD total | 20.92 | 26.14 | 125.0 % |
encoding-rs - cargo test --no-run
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 13.11 | 7.28 | 55.6 % |
| LLVM function passes | 0.57 | 0.33 | 58.1 % |
| LLVM module passes | 0.90 | 0.44 | 48.7 % |
| MonoItem collection | 1.19 | 0.69 | 58.1 % |
| translation | 8.68 | 6.08 | 70.1 % |
| LLVM total | 14.59 | 8.06 | 55.2 % |
| BUILD total | 15.73 | 14.37 | 91.4 % |
webrender - cargo build
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 109.42 | 69.17 | 63.2 % |
| LLVM function passes | 4.55 | 3.10 | 68.2 % |
| LLVM module passes | 1.63 | 1.06 | 64.7 % |
| MonoItem collection | 10.70 | 5.64 | 52.7 % |
| translation | 102.95 | 58.70 | 57.0 % |
| LLVM total | 115.60 | 73.33 | 63.4 % |
| BUILD total | 72.30 | 68.64 | 94.9 % |
futures-rs - cargo test --no-run
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 41.19 | 48.67 | 118.1 % |
| LLVM function passes | 1.68 | 1.93 | 115.0 % |
| LLVM module passes | 0.21 | 0.22 | 107.3 % |
| MonoItem collection | 5.86 | 6.90 | 117.8 % |
| translation | 55.48 | 69.84 | 125.9 % |
| LLVM total | 43.08 | 50.82 | 118.0 % |
| BUILD total | 17.28 | 19.18 | 111.0 % |
tokio-webpush-simple - cargo build
| | regular | MIR-only | % |
|------------------------|----------|----------|----------|
| LLVM codegen passes | 33.98 | 22.55 | 66.4 % |
| LLVM function passes | 1.53 | 0.95 | 62.0 % |
| LLVM module passes | 0.30 | 0.20 | 66.8 % |
| MonoItem collection | 3.80 | 2.11 | 55.5 % |
| translation | 39.09 | 22.99 | 58.8 % |
| LLVM total | 35.81 | 23.70 | 66.2 % |
| BUILD total | 22.28 | 21.54 | 96.7 % |
Number of LLVM function definitions generated for whole crate graph
| | MIR-only | regular |
|----------------------|----------|-----------|
| ripgrep | 22683 | 34239 |
| encoding-rs test | 8393 | 15116 |
| webrender | 72238 | 114239 |
| futures-rs test | 57565 | 46935 |
| tokio-webpush-simple | 27346 | 44961 |