Since ThinLTO became default in Rust my bootloader went from 17 KiB to about 42 KiB in size, which puts it over the 32 KiB limit. To get my bootloader to fit in this limit I've been able to strip out all of core::fmt by not having any parameters to my panic_fmt routine and this allows LTO to effectively delete all construction of the core::fmt parameters to panic_fmt and thus save huge amounts of code.
Since ThinLTO became default it seems that certain functions using volatile memory or inline assembly struggle to be optimized fully, however this is all I've found that replicates the issue, perhaps more common use of Rust could also trigger it.
For the tiny test case (included at the end of this issue) historically (from at least 2016 to November 2017) produced a .text section of ~100 bytes, and a total virtual program size of ~500 bytes. Since ThinLTO became default this tiny program grows to ~2400 bytes of .text and ~3200 bytes overall.
While this 2 KiB increase seems small, keep in mind it is for an application that is trivial and only has one thing using core::fmt (the bounds check panic on the array index). When a program grows and more panic come about these changes to be a much larger increase, in my case over 20 KiB.
While I'm writing an RFC to address this issue more permanently by proposing an optional (developer opt-in) panic routine which does not rely on core::fmt (probably just a &'static str for the filename and u32 for the line for bare bones debugging where nothing more can be afforded), I wanted to open this issue just in case it's a more fundamental optimization issue that could be fixed.
The core::intrinsics::abort() is critical in the panic_fmt for the code size increase to occur. Other things that work here as well are statements like asm!("nop") or really any assembly statement. Further, assembly statements that are included in functions that are called from panic_fmt also cause the optimization problem to occur. This makes workarounds impossible in my case as my panic dumps to a serial port which ultimately requires inline assembly to access. So for my bootloader I have resorted to using an old version of Rust.
What I find particularly strange is that while I've defined my panic_fmt to not take any parameters this doesn't seem to be a big enough hint to Rust/LLVM that they are not used and thus all construction of these parameters can be DCE'd.
For a size listing of this test program using all nightly builds of Rust in the past 1.5 years see: https://gist.github.com/gamozolabs/d72c05e2b200f02397f1495cf33c2674
We can see from this that the size increase occured between rustc 1.24.0-nightly (f9b0897c5 2017-12-02) and rustc 1.24.0-nightly (1956d5535 2017-12-03) which happens to be when ThinLTO went default.
I've tried a few different tests using various panic_fmt declarations, with parameters, without, as extern, as pub, etc, and they seem to all be affected by this issue.
Here's an example basic piece of code to be built with nightly-x86_64-pc-windows-msvc using args -C lto -C panic=abort -O:
Here is the disassembly difference between this same code but one with ThinLTO and one with full LTO: https://gist.github.com/gamozolabs/41b2059a55d5e0b60d7251dd8a31c055
#![no_std]
#![feature(lang_items, start, core_intrinsics)]
#[lang = "panic_fmt"]
#[no_mangle]
pub extern fn rust_begin_panic(_msg: core::fmt::Arguments,
_file: &'static str,
_line: u32,
_column: u32) -> ! {
unsafe { core::intrinsics::abort() }
}
// These functions are used by the compiler, but not
// for a bare-bones hello world. These are normally
// provided by libstd.
#[lang = "eh_personality"]
#[no_mangle]
pub extern fn rust_eh_personality() {
}
// This function may be needed based on the compilation target.
#[lang = "eh_unwind_resume"]
#[no_mangle]
pub extern fn rust_eh_unwind_resume() {
}
#[start]
#[no_mangle]
#[allow(non_snake_case)]
pub fn mainCRTStartup(argc: isize, _argv: *const *const u8) -> isize
{
let foo = [1, 2, 3, 4];
foo[argc as usize]
}
Further testing shows that building with -Z thinlto=off reverts to using old full LTO which on recent builds does behave as expected (deletes construction of core::fmt args to panic_fmt). This is now my current workaround and I'm no longer using an old version of Rust for my bootloader. My concern is in the future this may be unsupported, but I have no idea.
~However I still think it's worth looking into. I dumped the LLVM IR --emit llvm-ir and with thinlto=on and thinlto=off it produces identical IR. This leads me to believe that the issue lies more in LLVM than Rust, however I'm not familiar enough with the internals of Rust to conclude this. Further is this something we could optimize at MIR level as a Rust-specific optimization pass?~
Edit: Disregard above statement --emit llvm-ir seems to disable ThinLTO so I cannot use it to accurately determine if the issue is with Rust or LLVM.
-B
Thanks for the report @gamozolabs!
This sounds like it does indeed get blamed to ThinLTO! The previous form of LTO threw everything into one giant module which allows LLVM to see as much as possible and optimize as aggressively as possible. Typically this isn't necessary for applications, but as you're seeing there's still some that require the old form of LTO!
I've tagged this as a regression from stable to beta for now to ensure we can deal with this before the next release. There's a few things going on here of note.
-C lto, has switched to using ThinLTO by default rather than the previous "throw everything in a module" (what I'll call fat LTO) by default. This I believe is the source of your regression if you were previously compiling with LTO.-Z thinlto=no, but as the name implies it's not a stable option. Additionally there's no Carog.toml option for specifying ThinLTO on or off.So given that I think there's two primary questions to consider when evaluating this:
-C lto isn't passed (just amongst the codegen units we're producing) however though.-Z thinlto and add corresponding options to Cargo.toml?Curious what others from @rust-lang/compiler might think.
We could make -C lto take an additional parameter, so something like -C lto=auto (current behavior, default), -C lto=fat (old, "fat" LTO), -C lto=thin (current behavior again). We could document auto as not having a stable meaning, and expose these same options in Cargo's lto option.
I don't feel like the distinction should matter in general, but rather this is more of a LLVM limitation.
OTOH, we already have LLVM-specific options in -C.
Aside: It may be possible to improve ThinLTO here. My theory is that panic_fmt is not internalized because there are multiple modules, and thus deadargelim doesn't fire. Other interprocedural optimizations have similar problem and in those cases adding data to the module summary and teaching those optimizations about that has helped. For deadargelim, you probably can't eliminate the arguments (because you can't update all the callers unilaterally), but you could store the fact that an argument is unused in the module summary and callers who are aware of this can pass undef instead.
I think we should provide a way to get full LTO on stable/beta as it still provides better runtime performance overall.
@Mark-Simulacrum your suggestion sounds great to me! We could do what we did with debuginfo and accept lto = 'fat' in Cargo.toml as well as lto = true.
I like @Mark-Simulacrum's suggestion. I guess an open question is whether the default should be LTO or not -- but I think we can probably change the default to LTO if we think that's better. I'd be curious to hear from @gamozolabs if they agree.
Ok I've opened https://github.com/rust-lang/rust/pull/47521 to implement @Mark-Simulacrum's suggestion, and if we decide to merge I can follow up with a Cargo.toml option as well soon after merging.
I'm sort of ambivalent about which form of LTO should be the default when specifying only -C lto (either fat or thin LTO), although I'd slightly prefer ThinLTO as it should have all the same the-artifact's-runtime-is-faster benefits with a huge benefit on compile times. The downside of ThinLTO, as seen here though, is that it doesn't have the same code-size benefits.
I'm fine with defaulting to ThinLTO as long as full LTO is available on stable.
triage: P-high
This should get into beta, but there is a pending PR #47521
I think this is closed now with https://github.com/rust-lang/rust/pull/47548 merged
Most helpful comment
I think we should provide a way to get full LTO on stable/beta as it still provides better runtime performance overall.