I made a mistake in some code, which essentially made everything statically panic. Which is fine. The problem is that in exactly one instance of the same kind of code, rustc actually failed to remove the non-panicking branch. Note this /might/ be a duplicate of #48253. At least, similarly to #48253, it appears to be a regression in 1.24.0, according to godbolt.
A copy/pastable-to-godbolt and reduced version of the mistaken code is:
#[allow(non_camel_case_types)]
pub enum c_void {
// Two dummy variants so the #[repr] attribute can be used.
#[doc(hidden)]
__variant1,
#[doc(hidden)]
__variant2,
}
pub extern "C" fn malloc(size: usize) -> *mut c_void {
FUNCS.malloc.unwrap()(size)
}
pub extern "C" fn calloc(nmemb: usize, size: usize) -> *mut c_void {
FUNCS.calloc.unwrap()(nmemb, size)
}
pub extern "C" fn realloc(ptr: *mut c_void, size: usize) -> *mut c_void {
FUNCS.realloc.unwrap()(ptr, size)
}
pub struct MallocTable {
malloc: Option<extern "C" fn(usize) -> *mut c_void>,
calloc: Option<extern "C" fn(usize, usize) -> *mut c_void>,
realloc: Option<extern "C" fn(*mut c_void, usize) -> *mut c_void>,
}
pub static FUNCS: MallocTable =
MallocTable{malloc: None,
calloc: None,
realloc: None};
This generates the following assembly (skipping the data fields):
example::malloc:
push rbp
mov rbp, rsp
mov rax, qword ptr [rip + example::FUNCS@GOTPCREL]
mov rax, qword ptr [rax]
test rax, rax
je .LBB0_1
call rax
pop rbp
ret
.LBB0_1:
lea rdi, [rip + .Lref.2]
call core::panicking::panic@PLT
ud2
ud2
ud2
example::calloc:
push rbp
mov rbp, rsp
lea rdi, [rip + .Lref.2]
call core::panicking::panic@PLT
ud2
ud2
ud2
example::realloc:
push rbp
mov rbp, rsp
lea rdi, [rip + .Lref.2]
call core::panicking::panic@PLT
ud2
ud2
ud2
Note how calloc and realloc are properly just directly panicking because FUNCS is immutable and initialized with empty function pointers, leading to unwrap always panicking, but the same doesn't happen for malloc. Interestingly, removing realloc makes malloc optimized the same way as the others.
The code generated by 1.23.0 was:
example::malloc:
push rbp
mov rbp, rsp
lea rdi, [rip + ref.2]
call core::panicking::panic@PLT
ud2
example::calloc:
push rbp
mov rbp, rsp
lea rdi, [rip + ref.2]
call core::panicking::panic@PLT
ud2
example::realloc:
push rbp
mov rbp, rsp
lea rdi, [rip + ref.2]
call core::panicking::panic@PLT
ud2
triage: P-medium
Seems curious and worthy of fixing, but not a P-high problem. @glandium do you believe this optimization (or lack of it) is affecting other things?
@glandium do you believe this optimization (or lack of it) is affecting other things?
I think it could be the same underlying regression as #48253 but it's just a gut feeling, not backed by evidence. I guess I could take a stab at finding a regression range for both.
Okay, so I confirmed both this and #48253 have the same regression range, which is the last days before 1.23 made it to beta. 1.23.0.beta.2 exhibits both regressions, and stable-1.23 doesn't.
Going through the list of changes between the beginning of december when beta.2 was released and the 1.23.0 tag in the git repository points to a very likely cause for the regressions having been fixed on 1.23: 4979bcc8bd424df27f6c99af6907896a17c5be0e, which reverted the layout refactor, which f50fd075c2555d8511ccee8a7fe7aee3f2c45e14 apparently landed right before 1.23 merged to beta.
So this would be a regression from #45225, and #48253 is a dupe.
So, to see whether this had an impact on actual code, I tried with some work-in-progress code to replace https://dxr.mozilla.org/mozilla-central/source/memory/replace/logalloc/ which happened to be much slower when compiled with 1.24 compared to 1.23 (670ms vs 550ms to process 1.3M lines).
So I tried the nightly before and after #45225, and ... they both were fast. Which is a good news of some sort, since despite it definitely affecting codegen as demonstrated, it didn't seem to have a real impact. At least not on my code.
However, tracking it down, I identified the regression came from #46623, which is related. That one seems to have made std::fmt::Write functions slower, as well as things related to Option and Result, and probably more. I'm not sure whether I should open a separate issue.
@glandium wow, nice detective work!
@eddyb any clue on why #45225 or #46623 might have this effect?
My initial guess is that LLVM is very finicky and sensitive to our exact enum representation.
However, in the examples, this doesn't look like a dead code elimination bug, but rather failing to fold a load from a constant, which seems even more suspicious?
Can we get LLVM IR for the relevant example & version combinations?
Assembly doesn't tell much of a story, while the LLVM IR probably has an obviously foldable load.
LLVR IR for both the testcase here and the one in #48253, with nightly-2017-11-20:
https://gist.github.com/glandium/a73d38ae357154b194ec5eb9a4354730
with nightly-2017-11-21:
https://gist.github.com/glandium/7f0212770208d3a4105a7e275a10df2d
Those testcases don't seem to have been affected by #46623, so I'll try to reduce something from my code that got 20% slower.
At least, the LLVM IR is clear: it's not LLVM's doing.
Here's a reduced example that is not affected by #45225 but is by #46623:
#[inline(never)]
fn parse_hex<'a>(input: &'a [u8]) -> Option<usize> {
let mut s = input;
let mut result: Option<usize> = None;
loop {
if let Some((&c, remainder)) = s.split_first() {
let d = match c {
d @ b'0'...b'9' => d - b'0',
d @ b'a'...b'f' => d + 10 - b'a',
d @ b'A'...b'F' => d + 10 - b'A',
_ => break,
};
result = Some(match result {
Some(r) => r.checked_mul(16usize)
.and_then(|r| r.checked_add(d as usize))?,
None => d as usize,
});
s = remainder;
} else {
break;
}
}
result
}
fn main() {
for _ in 1..10000000 {
match parse_hex("85af342b1".as_bytes()) {
Some(_) => continue,
_ => break,
}
}
}
150ms with nightly-2017-12-15 vs 260ms with nightly-2017-12-16. Note that the inline(never) and the stupid loop in main are there to ensure the compiler doesn't eliminate dead code, and to reduce the IR size. Using a loop that generates random or sequential numbers to parse, with no inline(never) shows a perf difference too.
LLVM IR with nightly-2017-12-15:
https://gist.github.com/glandium/69008d47f7b2e3ecec06a082cd3b2dc3
with nightly-2017-12-16:
https://gist.github.com/glandium/b7bd384139b6a58fb68c03bc420a566f
And just in case, the LLVM IR with rustc built from the commit just before the merge of #46623, which is only trivially different from that of nightly-2017-12-15:
https://gist.github.com/glandium/45b90fe55e546edec1b5a9a452aba0aa
AFAICT, nightly-2017-12-16 was built from the merge of #46623, not a subsequent commit.
LLVR IR for both the testcase here and the one ...
Thank you! My suspicion was correct:
@_ZN4test5FUNCS17h4349eb3b1be595e8E = local_unnamed_addr constant { [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] } { [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef }, align 8
; ...
%0 = load i8*, i8** bitcast ({ [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] }* @_ZN4test5FUNCS17h4349eb3b1be595e8E to i8**), align 8
%1 = icmp eq i8* %0, null
LLVM should constant-fold these two and know that %0 is i8* null and %1 is i1 true.
I don't really know how it could fail such a trivial task (then again, their representation of constant data is a much poorer fit for this, than, say, miri's). cc @rkruppe
After a quick skim of the constant folding helper functions I can't see what causes this, but I have minimized the issue, it appears the problem are the [0 x i8] and/or the non-trivial initializer. Either dropping the [0 x i8] before the field being loaded or using zeroinitializer the makes -constprop effective.
I've submitted https://reviews.llvm.org/D55169 to fix this issue.
Patch landed upstream.
The patch has landed in Rust's LLVM.
The original issue here is fixed, but LLVM is still doing something slightly weird here with the four consecutive ud2 instructions. IR for reference: https://rust.godbolt.org/z/5vRhI1
This is long since resolved, and the redundant ud2s are no longer present either.