Rust: Implement cross-language ThinLTO

Created on 11 Apr 2018  Â·  50Comments  Â·  Source: rust-lang/rust

What is cross-language LTO?

Rust uses LLVM as its code generation backend, as does the Clang C/C++ compiler and many other languages. As a consequence, all of those LLVM-based compilers can produce artifacts that can partake in a common Link-Time-Optimization step, irrespective of the given source language. Thus, in this context, cross-language LTO means that we enable the Rust compiler to produce static libraries that can make use of LLVM-LTO-based linker plugins as exist for newer versions of ld, gold, and in lld.

Why is cross-language LTO a good thing?

In order for Rust to interoperate with code written in other languages, calls have to go through a C interface. This interface poses a boundary for inter-procedural optimizations like inlining. At the same time inter-procedural optimizations are very important for performance. Cross-language LTO makes this boundary transparent to LLVM, effectively allowing for C/C++ code to be inlined into Rust code and vice versa.

How can it be implemented?

There are several options. The basic requirement is that we emit LLVM bitcode into our object files in a format that the LLVM linker plugin can handle. There are two formats that fulfill this requirement:

  1. We emit .o files that actually aren't object files but plain LLVM bitcode files.
  2. We add the LLVM bitcode of an object file into a special .llvmbc section of the object file.

Given these requirements there are a few ways of implementing the feature:

  1. Always emit bitcode into object files instead of storing them as separate files in RLIBs

    • Pros

      • simple for users, it just works

      • this is something that some platforms, like IOS, want to have anyway

    • Cons

      • staticlibs would contain bitcode, even though it might not be needed

      • the Rust compiler would have to be modified to read bitcode out of a section instead of separate obj-file

      • we could not store bitcode compressed anymore

  2. Just stabilize -Z embed-bitcode and require users to do the rest

    • Pros

      • simple to implement

    • Cons

      • harder to use (needs user intervention)

      • unclear how to integrate with Cargo

      • RLIBs generated this way will contain bitcode twice

  3. Add a flag that makes rustc emit bitcode files instead of object files

    • Pros

      • no redundant codegen

      • no redundant machine code

    • Cons

      • harder to use (needs user intervention)

      • produces libraries that are incompatible with regular compilation, which is weird

      • even more strange special casing Rust compiler backend

      • unclear how to integrate with Cargo

  4. Add a -C cross-language-lto flag that (1) makes the compiler embed bitcode into RLIBs and static libraries, and (2) makes the compiler invoke the linker with the LLVMgold.so plugin, if applicable.

    • Pros

      • would make cross-language LTO available for binaries built with rustc

      • rustc can skip the redundant ThinLTO step for binaries and dylibs

      • RLIBs and staticlibs would be bigger but it's on an opt-in basis

    • Cons

      • since LTO is deferred to the linker, it would not be integrated with the Make jobserver

      • harder to use (needs user intervention)

      • unclear how to integrate with Cargo

I think I would opt for option (1) since it's the most straightforward to use. EDIT: Added option (4) which I also like.

cc @rust-lang/compiler @alexcrichton
(@rust-lang/wg-codegen might also be interested in this)

A-LLVM C-feature-request C-tracking-issue T-compiler WG-codegen

Most helpful comment

What were the absolute numbers of the speedup observed, though?

`Hm, I'm getting a bit different timings than yesterday, but they are even better:)`
EDIT: Scratch that, my test compiler has needs special settings. Updating in a sec...
EDIT2: Now the numbers should be correct (and are more in line with what I measured yesterday)

| | GCC 5.4 | CLANG 6.0 |
|-----------------|--------:|----------:|
| regex (release) | 12.85s | 10.5s |
| stylo (release) | 78.8s | 70.6s |
| regex (debug) | 6.25s | 5.9s |
| stylo (debug) | 68.5s | 64.3s |

This is LLVM compiled with Clang and Gold. The measurement is of building the whole crate graph each time.

All 50 comments

I think I personally like option 4 best, although I might spin it a little differently as -C lto=cross-language or something like that. In that case rustc would do nothing for LTO other than ensuring it's ready to LTO inside the linker, and then then linker could do all the work.

With -C lto=cross-language, would rustc take care of invoking the linker correctly (e.g. when compiling an executable)? And if so, how would rustc know whether to invoke the linker for thin or for full LTO?

@michaelwoerister yeah I think we could try to pass all the right options by default. I'm not actually sure how you configure full/thin at the linker layer?

One neat thing we could do, though, is that if you're on MSVC, for example, we could switch to lld by default or something like that

I'm not actually sure how you configure full/thin at the linker layer?

By passing -plugin-opt=thinlto to the linker, I think.

If we did it as -C lto=cross-language, we'd need another way of selecting thin vs full. Or have -C lto=thin-cross-language, which I find aesthetically displeasing :)

I'd love if we could shift all of LTO into the linker completely. But that would mean that we essentially can't use the MSVC linker anymore. And the Make jobserver story would regress too.

-Clto=thin,cross?

On Fri, Apr 13, 2018, 11:14 Michael Woerister notifications@github.com
wrote:

I'm not actually sure how you configure full/thin at the linker layer?

By passing -plugin-opt=thinlto to the linker, I think.

If we did it as -C lto=cross-language, we'd need another way of selecting
thin vs full. Or have -C lto=thin-cross-language, which I find
aesthetically displeasing :)

I'd love if we could shift all of LTO into the linker completely. But
that would mean that we essentially can't use the MSVC linker anymore. And
the Make jobserver story would regress too.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
https://github.com/rust-lang/rust/issues/49879#issuecomment-381059900,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApc0oZ6AV9XuBLXcI0eQQ2QLHzXH5MHks5toF5rgaJpZM4TQJAM
.

@michaelwoerister ah good point, in that case having a separate -C cross-language-lto seems fine by me (or @nagisa's idea)

@alexcrichton, do you know how to enable building the LLVMgold.so linker plugin? When I build LLVM from SVN, it gets built automatically but for rustc's LLVM version that doesn't seem to be the case.

Ah I've never built it myself so I'm not sure :(

A little status update:

  • #50000 implements option (4), but does not make rustc invoke the linker with the plugin yet.
  • Combining C code compiled with clang 5 and Rust code compiled with LLVM 6 does not seem to work very well. It complains that some functions are not marked as dso_local, which is something that the ThinLTO "promotion" pass seems to do for newer LLVM versions.
  • I was able to make clang inline Rust code via both fat and thin LTO when using clang 6 and nightly Rust. I had to use the Gold linker because the ld version on Ubuntu 16.04 does not support plugins yet. It might also work with newer versions of ld (see http://llvm.org/docs/GoldPlugin.html).
  • The Gold linker complains that the target triples of Rust and C don't match. Clang uses x86_64-pc-linux-gnu while Rust uses x86_64-unknown-linux-gnu. The mismatch should be harmless, the linker still performs LTO, but it would be nice to get rid of the warning. Would anybody in @rust-lang/infra or @rust-lang/compiler know how to make either compiler use a different target triple?
  • What I got working so far is compiling a Rust staticlib (with -Cpanic=abort) and making clang inline Rust functions into a C program:
rustc --crate-type=staticlib lib.rs -Zcross-lang-lto -O -Ccodegen-units=1 -o libsome_rust_lib.a -Cpanic=abort
clang-6.0 -c -o main.o -O2 -flto=thin main.c
clang-6.0 -O2 -o main.exe -fuse-ld=gold -flto=thin -Wl,-plugin-opt=save-temps -Wl,-start-group -L. -lsome_rust_lib main.o -Wl,-end-group

The next steps are:

  • Make it work the other way round: inline C code into Rust.
  • Experiment with the Firefox codebase, see how far we already get there.

TODO: Make rustc not run ThinLTO passes if -Zcross-lang-lto is given, since the linker will do them later anyway.

I now got it working the other way round too: inlining C code into a Rust executable. It turned out that the linker plugin was actually properly importing functions from the C module. But then it would refuse to inline the C functions. The missing piece was the -plugin-opt=mcpu=x86-64 flag that the linker plugin seems to need to actually perform the inlining.

So here is something interesting: I managed to build the Rust compiler and its LLVM with this and I'm seeing compile times reduced by 15-20% for release builds. Most of the reduction seems to come from LLVM just being faster. Translation is only marginally faster. Although I just realized that this isn't a fair comparison since the non-LTO version is compiled with GCC while the LTO version is compiled with CLANG...

OK, I did another comparison, this time building LLVM with CLANG instead of GCC and the CLANG version is quite a bit faster. It seems that at least 50% of the speedup observed before is just from building with CLANG 6.0 instead of GCC 5.4.

@michaelwoerister holy cow! Sounds like we should definitely be building with Clang!

Do you have time to work on that or should I try to get that landed?

@alexcrichton You certainly have more experience with the docker images and sccache. Just switching to Clang shouldn't be too hard. Also enabling ThinLTO is a bit more involved because of the linking step.

Ok cool. @michaelwoerister what was the benchmark you were using? (to verify the claims as well)

FWIW for Linux at least we're using a pretty ancient gcc, 4.7, and newer versions may actually have better optimizations as well

I was using the regex (current master) and style-servo (from rustc-perf) as benchmarks. Newer GCC versions will probably generate faster code than 4.7 but with Clang we have to future option of also using ThinLTO, so I think that's the better choice in any case.

Agreed! We mainly gotta figure out how to convince clang to work well with our custom libc builds we have all over the place. I'll work on switching to Clang 6 for everything in the near future.

I managed to build the Rust compiler and its LLVM with this

How did you deal with rdylibs here?

It seems that at least 50% of the speedup observed

What were the absolute numbers of the speedup observed, though?

How did you deal with rdylibs here?

I didn't have to do anything special for them.

cc @rust-lang/wg-codegen, btw

What were the absolute numbers of the speedup observed, though?

`Hm, I'm getting a bit different timings than yesterday, but they are even better:)`
EDIT: Scratch that, my test compiler has needs special settings. Updating in a sec...
EDIT2: Now the numbers should be correct (and are more in line with what I measured yesterday)

| | GCC 5.4 | CLANG 6.0 |
|-----------------|--------:|----------:|
| regex (release) | 12.85s | 10.5s |
| stylo (release) | 78.8s | 70.6s |
| regex (debug) | 6.25s | 5.9s |
| stylo (debug) | 68.5s | 64.3s |

This is LLVM compiled with Clang and Gold. The measurement is of building the whole crate graph each time.

@michaelwoerister so I've got a WIP at https://github.com/rust-lang/rust/pull/50200 which I'm trying to get a try build of so we can get an official perf run. I've locally though managed to build LLVM with Clang 6 on OSX, Windows, and Linux (on Linux using that docker build in the PR). Unfortunately though I'm only seeing modest to in-the-noise wins rather than the drastic 10% improvements you're seeing. The best I've found so far is locally on my Windows machine a full stylo build drops from 113 to 109s.

I'm curious though if I'm not doing quite the right optimizations or something like that? All I've done is switch the compiler and tweaked no other settings in all circumstances. On Linux, for example, I wasn't using Gold (which I don't think affects runtime performance much?), and I've been using ./x.py build and then directly using the resulting artifacts.

I did another test run, this time just changing CC and CXX in config.toml (i.e. not using Gold) and I get pretty much the same timings. I also ran with -j1 to make things more consistent:

| | GCC 5.4 | CLANG 6.0 |
|-----------------|--------:|----------:|
| regex (release) | 40.42s | 35.30s |
| stylo (release) | 304.50s | 270.83s |

So again about 10% improvement. Maybe my GCC version is especially bad?

One thing to note is that I compiled both the GCC and the CLANG versions with codegen-units=1 but that shouldn't make a difference for the LLVM part of the compiler. Anyway, I'll try with codegen-units=8 and report back.

When compiling rustc with codegen-units=4 timings for -j1 look about the same but the gap indeed closes a little for -j=20:

| | GCC 5.4 | CLANG 6.0 |
|-----------------|--------:|----------:|
| regex (release) | 12.75s | 11.40s |
| stylo (release) | 84.2s | 74.19s |
| regex (release -j1) | 40.47s | 35.45s |
| stylo (release -j1) | 307.99s | 275.37s |

@michaelwoerister ok thanks for checking! I'll keep pushing on https://github.com/rust-lang/rust/pull/50200 so we can get an official perf run

I compiled rustc with cross-lang LTO again and could very that functions from RustWrapper.cpp get inlined into librustc_trans-llvm.so :tada:

This seems to result in a measurable but not spectacular speedup for my two test crates:

| | GCC 5.4 | CLANG 6.0 | CLANG 6.0 + ThinLTO |
|-----------------|--------:|----------:|----------:|
| regex (release) | 12.75s | 11.40s | 10.30s |
| stylo (release) | 84.2s | 74.19s | 71.89s |
| regex (release -j1) | 40.47s | 35.45s | 33.11s |
| stylo (release -j1) | 307.99s | 275.37s | 252.31s |

I did some experimentation with cross-lang ThinLTO in Firefox. These are my findings:

  • On Linux, building FF with C++ ThinLTO fails at a very late stage. So I don't get a working browser but I get libxul.so which links together all Rust and C code, as far as I can tell.
  • When building with Rust & C++ ThinLTO enabled and using the LLD linker, we don't get any cross language inlining. LLD does not seem to treat the Rust object files as "bitcode enabled" (i.e. it does not generate temporary bitcode files showing the various optimization stages when being invoked with -plugin-opt=save-temps).
  • When building with the Gold linker that comes with FF (by specifying ac_add_options --enable-gold in .mozconfig), the linker crashes while linking libxul.so but it gets far enough to generate most save-temps bitcode files. In these files I could observe

    • that the C++ function Gecko_CSSValue_GetKeyword was successfully inlined into a Rust module, and

    • that the Rust function Servo_Property_IsAnimatable was successfully inlined into the calling C++ module.

So the outstanding problems here are to

  • either find out why Gold crashes
  • or find out why LLD does not see the LLVM bitcode in Rust object files.

@alexcrichton and @luser, what are the chances that we can turn on ThinLTO for building LLVM? Is ThinLTO compatible with sccache?

Can we use it together with incremental thinlto cache? https://clang.llvm.org/docs/ThinLTO.html#incremental

@michaelwoerister I believe we'd initially have to just take a hit to compile times. We build LLVM as a static library and link it directly into one of the dynamic libraries that we create. In that sense sccache has no opportunity to cache the ThinLTO output.

ThinLTO would probably happen when we compile the librustc_codegen_llvm crate. That's where we actually perform a fully linked artifact which internally links LLVM. At the start that link step would simply take quite a bit longer if we run ThinLTO (but we could measure this).

I do think we're positioned to turn on ThinLTO with LLVM for tier 1 platforms as soon as we're ready, AFAIK it's mostly rustbuild changes. All tier 1 platforms are using Clang 6 right now to compile LLVM

ThinLTO would probably happen when we compile the librustc_codegen_llvm crate. That's where we actually perform a fully linked artifact which internally links LLVM.

Yes, that matches with my observations.

At the start that link step would simply take quite a bit longer if we run ThinLTO (but we could measure this).

I'll see if I can put together a PR to get some numbers. Ideally we'd want to have cross-lang LTO to speed up librustc_codegen_llvm. However, that would have the additional complication that the Clang version we used should roughly our LLVM version.

Another question: At least on Windows and Linux we should use LLD. Is that available on CI?

Windows has LLD available through the clang 6 download but I believe that for Linux we'll have to compile LLD from source

@rust-lang/core, I'd like to get this feature (cross-language LTO) stabilized. Do we need an RFC for that? Or is a tracking issue sufficient?

I'm not sure which exact approach we'd be stabilizing -- there are 4 described in this issue -- could you explain?

The feature that would be stabilized is cross-language LTO, meaning we'd provide facilities in the Rust compiler to have inlining and other optimizations across language boundaries, performed via linker-based LLVM-LTO plugins. Concretely a -C cross-lang-lto flag would be provided that takes the following options:

  • off - The default.
  • on - The object files generated by the compiler contain LLVM IR for use by linker plugins. If a linker is invoked by rustc (i.e. for executables and dylibs) then it will be invoked with the correct plugin arguments (opt-level, thin- or fat-LTO, target-cpu, etc). If the generated crate type does not require linking (i.e. rlibs and staticlibs) then object files contained in the output archives are ready for consumption by appropriate linkers.
  • <path to linker-plugin> - Same as on but with an explicit linker plugin specified. This will be passed on to the linker invoked by rustc. If the generated crate type does no require linking, this option is the same as on.

Note that we don't want to guarantee a particular format for the object files generated and especially not that crates compiled with -Ccross-lang-lto can be used as dependencies when compiling without -Ccross-lang-lto. Making crates/object compatible with normal linkers comes at an additional cost in compile-time and file size and, when used in conjunction with ThinLTO, the resulting machine code would be less optimized. In practice, we'll do what Clang does, which is generate obj files that are really LLVM bitcode files. But that's an implementation detail.

Some examples of using this:

# C dependency in Rust
#=====================

# compile your C code and put it into a static archive
clang -c my_c_code.c -flto=thin -O2 -o my_c_code.o
llvm-ar rv libmy_c_code.a my_c_code.o

# Use rustc to compile your mixed Rust/C program, letting rustc take care of invoking the linker

# If clang/lld is not your default linker
rustc -Ccross-lang-lto -Clinker=clang -Clink-arg=-fuse-ld=lld -L. -O my_rust_code.rs 

# If clang/lld *is* your default linker
rustc -Ccross-lang-lto -L. -O my_rust_code.rs 

# If you want to use the Gold linker with a specific plugin
rustc -Ccross-lang-lto=<path to LLVMgold.so> -Clink-arg=-fuse-ld=gold -L. -O my_rust_code.rs 
# Rust dependency in C
#=====================

# Compile your C code prepared for (Thin-) LTO
clang -c my_c_code.c -flto=thin -O2 -o my_c_code.o

# Compile your Rust code prepared for (Thin-) LTO into a staticlib
rustc -Ccross-lang-lto -O --crate-type=staticlib my_rust_code.rs 

# Use clang/lld to link everything, including the LTO step
clang -fuse-ld=lld -flto=thin -O2 -L. -lmy_rust_code my_c_code.o

Thanks! I think we don't need an RFC for this -- it seems like a feature addition that is quite limited. However, we should go through the usual FCP on a tracking issue (e.g., here) and cc stakeholders (not sure who, specifically, though). I think the description you give already does this, but I'd also like to be careful to not stabilize anything LLVM specific or generally dependent on a non "standard" linker feature. But it looks like ld/gold/lld support this in some fashion so I'm happy with this!

Well, it is LLVM specific, which is an interesting point. The linker plugin mechanism is pretty universal for modern Unix linkers, but the compilers and linker plugins in question all have to be LLVM-based. We might want to take this into account somehow. In my opinion, it's not a problem to stabilize something that is only available with the LLVM backend (like with already have with --emit llvm-ir) but the name of the flag should maybe reflect that.

If it's LLVM specific then renaming it seems good -- I understood it as something that is currently LLVM specific, but in theory the underlying format could be used by others, e.g. cranelift, though perhaps with a different IR format. I think it's reasonable to rename the flag to -C llvm-cross-lang-lto or something along those lines -- seems low-cost for users and makes the LLVM dependency explicit for us.

AIUI, cross language LTO is not particularly "cross language" specific. So why should that appear in the flag name?

AIUI, cross language LTO is not particularly "cross language" specific. So why should that appear in the flag name?

It's the only form of LTO that allows for crossing language boundaries. But I agree, it's not specific to multi-language scenarios. Maybe something like -C llvm-linker-plugin-LTO?

Obviously this is an inferior solution than getting the rust compiler to provide a solution, but I played around with what was available so far on the nightly branch and wasn't very satisfied with what I found, so I thought I'd share the workaround I came up with in case it's of interest to anyone else out there who is trying to link C/C++ and Rust code together with clang. It's a simple script that takes a .rlib file and emits a .a file containing the llvm bitcode of the library in a format suitable for passing to clang -flto.

#!/bin/bash
# usage: rust-lto-munge <file.rlib>
dir="$(mktemp -d)"
trap "rm -rf $dir" INT TERM EXIT
archive=$(realpath -m $1)
cd "$dir"
ar x "$archive"
rm ./*.o
for file in *.bc.z; do
len=`od -An -t u4 -j 15 -N4 $file`
blen=`od -An -t u8 -j $((len+19)) -N8 $file`
tail -c+$((len+28)) $file | head -c $blen > $file.bc.gz
printf "\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00" |cat - $file.bc.gz |gzip -dc > ${file%.bc.z}.o
done
rm *.bc.z
rm *.gz
llvm-ar-6.0 rs "${archive%.rlib}.a" ./*

This code is released under the UIUC license: https://en.wikipedia.org/wiki/University_of_Illinois/NCSA_Open_Source_License

@dwightguth, can you elaborate on what you found lacking with what nightly provides at the moment?

Well, I just couldn't get it to work with Cargo. No matter what I tried, it either didn't pass the -Z flag to rustc, or it crashed because it couldn't link the binaries that my library depended on. Ultimately I think I needed some dependencies to be built with cross-lang-lto and others without, but since there's no way to specify rust flags on a per crate level, I was stuck.

EDIT: also, rust nightly uses llvm 8 and since it has not in fact been released yet, I didn't want to upgrade.

I'm going to raise another issue here since I ran into it trying to get rust and C++ LTO working together, and I want you to be aware of it because it's possible it might impact this. When I try to extract the bitcode from the standard library (eg crate std), the bitcode extracted apparently does not verify with the system llvm. This occurs even if the llvm version used by rustc and the llvm version used to verify the bitcode seem to match, and it also occurs on the latest stable of rustc. However, if I use the version of lld present in the rust distribution, it works, suggesting that the problem has to do with the patches that rust added to llvm. Is it possible that the rustc llvm has been patched in ways that will change the structure of llvm bitcode? and if so, it seems unlikely that you will be able to take full advantage of lto when compiling multiple languages together unless you distribute a compatible version of llvm yourself.

Yeah, these issues are probably both related to Rust's and Clang's LLVM version not being compatible. Unless you use the same version for both, things will likely fail. At some point we thought that LLD would be able to handle older versions of bitcode so that everything would be fine as long as your LLD is at least as new as the LLVM of the two compilers -- but that doesn't always seem to be true either. As a consequence one has to be rather careful to use the right compiler versions (which can often only be obtained by building from source). In practice this will probably mean that this optimization will only be usable in niche cases.

This actually happens when both Rust and Clang are supposedly exactly LLVM 6.0. I suspect the issue has to do with rust's non-upstreamed llvm patches, but I don't know more than that.

Is there an example project somewhere showing how to build a C library using cargo such that it can be inlined into Rust? I've been trying to get this enabled in the jemalloc-sys and sleef-sys crates without any luck.

This has been stabilized in #58057. Closing.

Was this page helpful?
0 / 5 - 0 ratings