Test case:
fn main() {
for _ in 0 .. 100 {}
}
Run with:
$ rustc -O -C remark=all --emit asm --emit llvm-ir 1.rs
Expected: The loop is elided
Actual result: The loop still remains, in both LLVM-IR and ASM output.
The loop is elided when the upper limit is 99. -C opt-level=2 and 3 make no difference.
Compiler output, showing LLVM failed to vectorize the loop because it could not determine number of loop iterations:
warning: -C remark will not show source locations without --debuginfo
note: optimization analysis for inline at [unknown]: _ZN38_$LT$i32$u20$as$u20$core..ops..Add$GT$3add17ha3e8e1701e046ad4E can be inlined into _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE with cost=-15000 (threshold=487)
note: optimization remark for inline at [unknown]: _ZN38_$LT$i32$u20$as$u20$core..ops..Add$GT$3add17ha3e8e1701e046ad4E inlined into _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE
note: optimization analysis for inline at [unknown]: _ZN4core3cmp5impls55_$LT$impl$u20$core..cmp..PartialOrd$u20$for$u20$i32$GT$2lt17he1f17f7a171928f2E can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-14995 (threshold=487)
note: optimization remark for inline at [unknown]: _ZN4core3cmp5impls55_$LT$impl$u20$core..cmp..PartialOrd$u20$for$u20$i32$GT$2lt17he1f17f7a171928f2E inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E
note: optimization analysis for inline at [unknown]: _ZN4core3mem4swap17h6aaf5756849aed58E can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-15000 (threshold=487)
note: optimization remark for inline at [unknown]: _ZN4core3mem4swap17h6aaf5756849aed58E inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E
note: optimization analysis for inline at [unknown]: _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE can be inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E with cost=-14995 (threshold=487)
note: optimization remark for inline at [unknown]: _ZN47_$LT$i32$u20$as$u20$core..iter..range..Step$GT$7add_one17hea1d4412e959396dE inlined into _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E
note: optimization analysis for inline at [unknown]: _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E can be inlined into _ZN2_14main17h6f9533fc3f344a87E with cost=-14980 (threshold=325)
note: optimization remark for inline at [unknown]: _ZN4core4iter5range86_$LT$impl$u20$core..iter..iterator..Iterator$u20$for$u20$core..ops..Range$LT$A$GT$$GT$4next17h59cb9f32d4a0d028E inlined into _ZN2_14main17h6f9533fc3f344a87E
note: optimization remark for tailcallelim at [unknown]: marked this call a tail call candidate
note: optimization analysis for loop-vectorize at [unknown]: loop not vectorized: could not determine number of loop iterations
note: optimization missed for loop-vectorize at [unknown]: loop not vectorized: use -Rpass-analysis=loop-vectorize for more info
Versions:
$ rustc -Vv
rustc 1.18.0-nightly (2564711e8 2017-04-04)
binary: rustc
commit-hash: 2564711e803f62e04bebf10408cc1c11297c0caf
commit-date: 2017-04-04
host: x86_64-apple-darwin
release: 1.18.0-nightly
LLVM version: 3.9
(See discussion on http://stackoverflow.com/a/43234445/224671 for details)
Interestingly if the type of the range is at least 64-bit on x86_64, the loop is also completely elided.
Loop-vectorization failed if the type is i8, u8, i16, u16, i32, u32. That is, the failure happens only when the integer size is smaller than the native word size.
// Completely fine:
fn main() {
for _ in 0usize .. 100_000_000usize {}
}
Probably the number 100 comes from the default value of the brute force parameter. Seems it can only recognize the loop iteration count if the type is >= the native word size, and falls back to brute force if its not.
This is fine as well:
#![feature(i128_type)]
fn main() {
for i in 0..100_000_000_000_000_000u128 {}
}
Been playing around with this - this seems pretty bad and severely degrades my confidence in the ability of the compiler to optimize complex code! I tried desugaring the for loop to get an idea as to what's going on -- the following replicates the problem:
struct Range {
start: u32,
end: u32,
}
impl Iterator for Range {
type Item = u32;
fn next(&mut self) -> Option<u32> {
if self.start < self.end {
let n = self.start;
self.start = self.start + 1;
Some(n)
} else {
// self.start = self.start + 1 // uncomment me for speedup
None
}
}
}
fn main() {
let mut range = Range { start: 0, end: 100 };
loop {
match range.next() {
Some(_) => {},
None => break,
}
}
}
Changing the unused return type of next to () or usize causes the loop to be removed; as does uncommenting the indicated line. Returning a constant value of 0 from the next function does not remove the loop.
The following c code is optimized in a very similar way by LLVM via clang - so does look like an LLVM bug. Not that we can't work around it. For reference - this code optimizes down to nothing in gcc 6.3.0.
struct range {
int start;
int end;
};
struct option {
int ret;
int tag;
};
struct option next(struct range *range) {
if (range->start < range->end) {
int n = range->start;
range->start++;
struct option option = {n, 1};
return option;
} else {
struct option option = {0, 0};
return option;
}
}
int main() {
struct range range = {0, 100};
for(;;) {
struct option option = next(&range);
if (option.tag) {
} else {
break;
}
}
return 0;
}
Minimal example -
fn main() {
let mut start = 0;
let end = 10000;
loop {
let tag;
if start < end {
start += 1;
tag = true;
} else {
tag = false;
}
if !tag { break; }
}
}
int main() {
int start = 0;
int end = 10000;
int tag = 0;
do {
tag = 0;
if (start < end) start++, tag = 1;
} while (tag);
return 0;
}
How about changing the range to start at 1
fn main() {
for _ in 1 .. 100 {}
}
Below is the result where the loop is removed.
$ rustc main_start_1.rs -C opt-level=2 -o builds/main_start_1
$ objdump -x builds/main_start_1 | grep main
builds/main_start_1: file format mach-o-x86-64
builds/main_start_1
00000001000008b0 l 0e SECT 01 0000 [.text] __ZN12main_start_14main17hb1ec1b2167cb471eE
00000001000008c0 g 0f SECT 01 0000 [.text] _main
$ objdump -D --start-address=0x1000008b0 builds/main_start_1 | awk '{print $0} $3~/retq?/{exit}'
builds/main_start_1: file format mach-o-x86-64
Disassembly of section .text:
00000001000008b0 <__ZN12main_start_14main17hb1ec1b2167cb471eE>:
1000008b0: 55 push %rbp
1000008b1: 48 89 e5 mov %rsp,%rbp
1000008b4: 5d pop %rbp
1000008b5: c3 retq
$ rustc -vV
rustc 1.18.0-nightly (bbdaad0dc 2017-04-14)
binary: rustc
commit-hash: bbdaad0dc8dc64e036ccee817f90a91876b32a9d
commit-date: 2017-04-14
host: x86_64-apple-darwin
release: 1.18.0-nightly
LLVM version: 3.9
I wonder why the results differ if started at 1 than 0.
@nateozem that's probably because then the loop count is below 100 (below scalar-evolution-max-iterations), and then the optimisation works.
@djzin
Interestingly, in your example either changing the loop condition to != instead of <, or setting self.start to self.end or u32::max_value() optimizes out the loop when end is < than 147 rather than 100.
One possible workaround to this issue could be if for loops started using the new for_each trait, which could be specialized for ranges of primitive numbers to avoid the option value entirely.
@djzin i've filled https://bugs.llvm.org/show_bug.cgi?id=34538 and giving you credit for the code and the mwe, you might want to subscribe to the issue
Fixed in #47828
Most helpful comment
Interestingly if the type of the range is at least 64-bit on x86_64, the loop is also completely elided.
Loop-vectorization failed if the type is
i8,u8,i16,u16,i32,u32. That is, the failure happens only when the integer size is smaller than the native word size.