rustc fails to remove dead code in exactly one instance of multiple similar code

Created on 1 Mar 2018 · 16Comments · Source: rust-lang/rust

I made a mistake in some code, which essentially made everything statically panic. Which is fine. The problem is that in exactly one instance of the same kind of code, rustc actually failed to remove the non-panicking branch. Note this /might/ be a duplicate of #48253. At least, similarly to #48253, it appears to be a regression in 1.24.0, according to godbolt.

A copy/pastable-to-godbolt and reduced version of the mistaken code is:

#[allow(non_camel_case_types)] 
pub enum c_void {
    // Two dummy variants so the #[repr] attribute can be used.
    #[doc(hidden)]
    __variant1,
    #[doc(hidden)]
    __variant2,
}

pub extern "C" fn malloc(size: usize) -> *mut c_void {
    FUNCS.malloc.unwrap()(size)
}
pub extern "C" fn calloc(nmemb: usize, size: usize) -> *mut c_void {
    FUNCS.calloc.unwrap()(nmemb, size)
}
pub extern "C" fn realloc(ptr: *mut c_void, size: usize) -> *mut c_void {
    FUNCS.realloc.unwrap()(ptr, size)
}

pub struct MallocTable {
    malloc: Option<extern "C" fn(usize) -> *mut c_void>,
    calloc: Option<extern "C" fn(usize, usize) -> *mut c_void>,
    realloc: Option<extern "C" fn(*mut c_void, usize) -> *mut c_void>,
}

pub static FUNCS: MallocTable =
    MallocTable{malloc: None,
                calloc: None,
                realloc: None};

This generates the following assembly (skipping the data fields):

example::malloc:
  push rbp
  mov rbp, rsp
  mov rax, qword ptr [rip + example::FUNCS@GOTPCREL]
  mov rax, qword ptr [rax]
  test rax, rax
  je .LBB0_1
  call rax
  pop rbp
  ret
.LBB0_1:
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

example::calloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

example::realloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + .Lref.2]
  call core::panicking::panic@PLT
  ud2
  ud2
  ud2

Note how calloc and realloc are properly just directly panicking because FUNCS is immutable and initialized with empty function pointers, leading to unwrap always panicking, but the same doesn't happen for malloc. Interestingly, removing realloc makes malloc optimized the same way as the others.

The code generated by 1.23.0 was:

example::malloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2

example::calloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2

example::realloc:
  push rbp
  mov rbp, rsp
  lea rdi, [rip + ref.2]
  call core::panicking::panic@PLT
  ud2

A-LLVM C-enhancement I-slow P-medium T-compiler regression-from-stable-to-stable

Source

glandium

All 16 comments

triage: P-medium

Seems curious and worthy of fixing, but not a P-high problem. @glandium do you believe this optimization (or lack of it) is affecting other things?

nikomatsakis on 22 Mar 2018

@glandium do you believe this optimization (or lack of it) is affecting other things?

I think it could be the same underlying regression as #48253 but it's just a gut feeling, not backed by evidence. I guess I could take a stab at finding a regression range for both.

glandium on 22 Mar 2018

Okay, so I confirmed both this and #48253 have the same regression range, which is the last days before 1.23 made it to beta. 1.23.0.beta.2 exhibits both regressions, and stable-1.23 doesn't.
Going through the list of changes between the beginning of december when beta.2 was released and the 1.23.0 tag in the git repository points to a very likely cause for the regressions having been fixed on 1.23: 4979bcc8bd424df27f6c99af6907896a17c5be0e, which reverted the layout refactor, which f50fd075c2555d8511ccee8a7fe7aee3f2c45e14 apparently landed right before 1.23 merged to beta.

So this would be a regression from #45225, and #48253 is a dupe.

glandium on 23 Mar 2018

So, to see whether this had an impact on actual code, I tried with some work-in-progress code to replace https://dxr.mozilla.org/mozilla-central/source/memory/replace/logalloc/ which happened to be much slower when compiled with 1.24 compared to 1.23 (670ms vs 550ms to process 1.3M lines).

So I tried the nightly before and after #45225, and ... they both were fast. Which is a good news of some sort, since despite it definitely affecting codegen as demonstrated, it didn't seem to have a real impact. At least not on my code.

However, tracking it down, I identified the regression came from #46623, which is related. That one seems to have made std::fmt::Write functions slower, as well as things related to Option and Result, and probably more. I'm not sure whether I should open a separate issue.

glandium on 23 Mar 2018

@glandium wow, nice detective work!

@eddyb any clue on why #45225 or #46623 might have this effect?

nikomatsakis on 23 Mar 2018

My initial guess is that LLVM is very finicky and sensitive to our exact enum representation.
However, in the examples, this doesn't look like a dead code elimination bug, but rather failing to fold a load from a constant, which seems even more suspicious?

Can we get LLVM IR for the relevant example & version combinations?
Assembly doesn't tell much of a story, while the LLVM IR probably has an obviously foldable load.

eddyb on 23 Mar 2018

👍1

LLVR IR for both the testcase here and the one in #48253, with nightly-2017-11-20:
https://gist.github.com/glandium/a73d38ae357154b194ec5eb9a4354730

with nightly-2017-11-21:
https://gist.github.com/glandium/7f0212770208d3a4105a7e275a10df2d

Those testcases don't seem to have been affected by #46623, so I'll try to reduce something from my code that got 20% slower.

At least, the LLVM IR is clear: it's not LLVM's doing.

glandium on 23 Mar 2018

Here's a reduced example that is not affected by #45225 but is by #46623:

#[inline(never)]
fn parse_hex<'a>(input: &'a [u8]) -> Option<usize> {
    let mut s = input;
    let mut result: Option<usize> = None;
    loop {
        if let Some((&c, remainder)) = s.split_first() {
            let d = match c {
                d @ b'0'...b'9' => d - b'0',
                d @ b'a'...b'f' => d + 10 - b'a',
                d @ b'A'...b'F' => d + 10 - b'A',
                _ => break,
            };
            result = Some(match result {
                Some(r) => r.checked_mul(16usize)
                    .and_then(|r| r.checked_add(d as usize))?,
                None => d as usize,
            });
            s = remainder;
        } else {
            break;
        }
    }
    result
}

fn main() {
    for _ in 1..10000000 {
        match parse_hex("85af342b1".as_bytes()) {
            Some(_) => continue,
            _ => break,
        }
    }
}

150ms with nightly-2017-12-15 vs 260ms with nightly-2017-12-16. Note that the inline(never) and the stupid loop in main are there to ensure the compiler doesn't eliminate dead code, and to reduce the IR size. Using a loop that generates random or sequential numbers to parse, with no inline(never) shows a perf difference too.

LLVM IR with nightly-2017-12-15:
https://gist.github.com/glandium/69008d47f7b2e3ecec06a082cd3b2dc3

with nightly-2017-12-16:
https://gist.github.com/glandium/b7bd384139b6a58fb68c03bc420a566f

glandium on 24 Mar 2018

And just in case, the LLVM IR with rustc built from the commit just before the merge of #46623, which is only trivially different from that of nightly-2017-12-15:
https://gist.github.com/glandium/45b90fe55e546edec1b5a9a452aba0aa
AFAICT, nightly-2017-12-16 was built from the merge of #46623, not a subsequent commit.

glandium on 24 Mar 2018

LLVR IR for both the testcase here and the one ...

Thank you! My suspicion was correct:

@_ZN4test5FUNCS17h4349eb3b1be595e8E = local_unnamed_addr constant { [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] } { [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef, {}* null, [0 x i8] undef }, align 8

; ...

  %0 = load i8*, i8** bitcast ({ [0 x i8], {}*, [0 x i8], {}*, [0 x i8], {}*, [0 x i8] }* @_ZN4test5FUNCS17h4349eb3b1be595e8E to i8**), align 8
  %1 = icmp eq i8* %0, null

LLVM should constant-fold these two and know that %0 is i8* null and %1 is i1 true.
I don't really know how it could fail such a trivial task (then again, their representation of constant data is a much poorer fit for this, than, say, miri's). cc @rkruppe

eddyb on 10 Apr 2018

After a quick skim of the constant folding helper functions I can't see what causes this, but I have minimized the issue, it appears the problem are the [0 x i8] and/or the non-trivial initializer. Either dropping the [0 x i8] before the field being loaded or using zeroinitializer the makes -constprop effective.

hanna-kruppe on 11 Apr 2018

😄1

I've submitted https://reviews.llvm.org/D55169 to fix this issue.

nikic on 1 Dec 2018

🎉1

Patch landed upstream.

nikic on 12 Dec 2018

🎉1

The patch has landed in Rust's LLVM.

mati865 on 25 Feb 2019

The original issue here is fixed, but LLVM is still doing something slightly weird here with the four consecutive ud2 instructions. IR for reference: https://rust.godbolt.org/z/5vRhI1

nikic on 18 Jul 2019

This is long since resolved, and the redundant ud2s are no longer present either.

nikic on 23 May 2020

Was this page helpful?

0 / 5 - 0 ratings