Rust: LLVM failed to optimize out the empty loop `for _ in 0 .. 100 {}`

Created on 6 Apr 2017  路  10Comments  路  Source: rust-lang/rust

Test case:

fn main() {
    for _ in 0 .. 100 {}
}

Run with:

$ rustc -O -C remark=all --emit asm --emit llvm-ir 1.rs

Expected: The loop is elided
Actual result: The loop still remains, in both LLVM-IR and ASM output.

The loop is elided when the upper limit is 99. -C opt-level=2 and 3 make no difference.


Compiler output, showing LLVM failed to vectorize the loop because it could not determine number of loop iterations:

warning: -C remark will not show source locations without --debuginfo

note: optimization analysis for inline at [unknown]: _ZN38_$LT$i32$u20$as$u20$core..ops..Add$GT$3add17ha3e8e1701e046ad4E can be inlined into _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE with cost=-15000 (threshold=487)

note: optimization remark for inline at [unknown]: _ZN38_$LT$i32$u20$as$u20$core..ops..Add$GT$3add17ha3e8e1701e046ad4E inlined into _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE

note: optimization analysis for inline at [unknown]: _ZN4core3cmp5impls55_$LT$impl$u20$core..cmp..PartialOrd$u20$for$u20$i32$GT$2lt17he1f17f7a171928f2E can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-14995 (threshold=487)

note: optimization remark for inline at [unknown]: _ZN4core3cmp5impls55_$LT$impl$u20$core..cmp..PartialOrd$u20$for$u20$i32$GT$2lt17he1f17f7a171928f2E inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E

note: optimization analysis for inline at [unknown]: _ZN4core3mem4swap17h6aaf5756849aed58E can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-15000 (threshold=487)

note: optimization remark for inline at [unknown]: _ZN4core3mem4swap17h6aaf5756849aed58E inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E

note: optimization analysis for inline at [unknown]: _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-14995 (threshold=487)

note: optimization remark for inline at [unknown]: _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E

note: optimization analysis for inline at [unknown]: _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E can be inlined into _ZN2_14main17h6f9533fc3f344a87E with cost=-14980 (threshold=325)

note: optimization remark for inline at [unknown]: _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E inlined into _ZN2_14main17h6f9533fc3f344a87E

note: optimization remark for tailcallelim at [unknown]: marked this call a tail call candidate

note: optimization analysis for loop-vectorize at [unknown]: loop not vectorized: could not determine number of loop iterations

note: optimization missed for loop-vectorize at [unknown]: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info

Versions:

$ rustc -Vv
rustc 1.18.0-nightly (2564711e8 2017-04-04)
binary: rustc
commit-hash: 2564711e803f62e04bebf10408cc1c11297c0caf
commit-date: 2017-04-04
host: x86_64-apple-darwin
release: 1.18.0-nightly
LLVM version: 3.9

(See discussion on http://stackoverflow.com/a/43234445/224671 for details)

A-LLVM C-bug

Most helpful comment

Interestingly if the type of the range is at least 64-bit on x86_64, the loop is also completely elided.

Loop-vectorization failed if the type is i8, u8, i16, u16, i32, u32. That is, the failure happens only when the integer size is smaller than the native word size.

// Completely fine:
fn main() {
    for _ in 0usize .. 100_000_000usize {}
}

All 10 comments

Interestingly if the type of the range is at least 64-bit on x86_64, the loop is also completely elided.

Loop-vectorization failed if the type is i8, u8, i16, u16, i32, u32. That is, the failure happens only when the integer size is smaller than the native word size.

// Completely fine:
fn main() {
    for _ in 0usize .. 100_000_000usize {}
}

Probably the number 100 comes from the default value of the brute force parameter. Seems it can only recognize the loop iteration count if the type is >= the native word size, and falls back to brute force if its not.

This is fine as well:

#![feature(i128_type)]

fn main() {
    for i in 0..100_000_000_000_000_000u128 {}
}

Been playing around with this - this seems pretty bad and severely degrades my confidence in the ability of the compiler to optimize complex code! I tried desugaring the for loop to get an idea as to what's going on -- the following replicates the problem:

struct Range {
    start: u32,
    end: u32,
}

impl Iterator for Range {
    type Item = u32;

    fn next(&mut self) -> Option<u32> {
        if self.start < self.end {
            let n = self.start;
            self.start = self.start + 1;
            Some(n)
        } else {
            // self.start = self.start + 1 // uncomment me for speedup
            None
        }
    }
}

fn main() {
    let mut range = Range { start: 0, end: 100 };
    loop {
        match range.next() {
            Some(_) => {},
            None => break,
        }
    }
}

Changing the unused return type of next to () or usize causes the loop to be removed; as does uncommenting the indicated line. Returning a constant value of 0 from the next function does not remove the loop.

The following c code is optimized in a very similar way by LLVM via clang - so does look like an LLVM bug. Not that we can't work around it. For reference - this code optimizes down to nothing in gcc 6.3.0.

struct range {
    int start;
    int end;
};

struct option {
    int ret;
    int tag;
};

struct option next(struct range *range) {
    if (range->start < range->end) {
        int n = range->start;
        range->start++;
        struct option option = {n, 1};
        return option;
    } else {
        struct option option = {0, 0};
        return option;
    }
}

int main() {
    struct range range = {0, 100};
    for(;;) {
        struct option option = next(&range);
        if (option.tag) {
        } else {
            break;
        }
    }
    return 0;
}

Minimal example -

fn main() {
    let mut start = 0;
    let end = 10000;
    loop {
        let tag;
        if start < end {
            start += 1;
            tag = true;
        } else {
            tag = false;
        }
        if !tag { break; }
    }
}
int main() {
    int start = 0;
    int end = 10000;
    int tag = 0;
    do {
        tag = 0;
        if (start < end) start++, tag = 1;
    } while (tag);
    return 0;
}

How about changing the range to start at 1

fn main() {
    for _ in 1 .. 100 {}
}

Below is the result where the loop is removed.

$ rustc main_start_1.rs -C opt-level=2 -o builds/main_start_1

$ objdump -x builds/main_start_1 | grep main
 builds/main_start_1:     file format mach-o-x86-64
 builds/main_start_1
 00000001000008b0 l       0e SECT   01 0000 [.text] __ZN12main_start_14main17hb1ec1b2167cb471eE
 00000001000008c0 g       0f SECT   01 0000 [.text] _main

$ objdump -D --start-address=0x1000008b0 builds/main_start_1  | awk '{print $0} $3~/retq?/{exit}'

 builds/main_start_1:     file format mach-o-x86-64


 Disassembly of section .text:

 00000001000008b0 <__ZN12main_start_14main17hb1ec1b2167cb471eE>:
    1000008b0:   55                      push   %rbp
    1000008b1:   48 89 e5                mov    %rsp,%rbp
    1000008b4:   5d                      pop    %rbp
    1000008b5:   c3                      retq

$ rustc -vV
 rustc 1.18.0-nightly (bbdaad0dc 2017-04-14)
 binary: rustc
 commit-hash: bbdaad0dc8dc64e036ccee817f90a91876b32a9d
 commit-date: 2017-04-14
 host: x86_64-apple-darwin
 release: 1.18.0-nightly
 LLVM version: 3.9

I wonder why the results differ if started at 1 than 0.

@nateozem that's probably because then the loop count is below 100 (below scalar-evolution-max-iterations), and then the optimisation works.

@djzin

Interestingly, in your example either changing the loop condition to != instead of <, or setting self.start to self.end or u32::max_value() optimizes out the loop when end is < than 147 rather than 100.

One possible workaround to this issue could be if for loops started using the new for_each trait, which could be specialized for ranges of primitive numbers to avoid the option value entirely.

@djzin i've filled https://bugs.llvm.org/show_bug.cgi?id=34538 and giving you credit for the code and the mwe, you might want to subscribe to the issue

Fixed in #47828

Was this page helpful?
0 / 5 - 0 ratings