Rust: LTO triggers apparent miscompilation on Windows 10 x64

Created on 19 Jul 2020 · 23Comments · Source: rust-lang/rust

I tried this code:

Cargo.toml:

[package]
name = "bmp-minimal"
version = "0.1.0"
authors = ["Malloc Voidstar <[email protected]>"]
edition = "2018"

[dependencies]

[profile.release]
lto = true
codegen-units = 1

main.rs:

use std::io::Write;

fn main() {
    let mut out = Vec::with_capacity(51200);
    for val in 0u8..=255 {
        out.write_all(&[val, val, val, 0]).unwrap();
    }

    for _ in (0..1).rev() {
        for _ in 0..1 {
            out.write_all(&[0]).unwrap();
        }
    }
    println!("{} bytes written", out.len());
}

(extremely minimized form of grayscale BMP encoding from image crate)

Command: cargo run --release

I expected to see this happen: 1025 bytes written

Instead, this happened:

memory allocation of 26843545600 bytes failederror: process didn't exit successfully: `target\release\bmp-minimal.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

Most helpful comment

A bit smaller,

use std::io::Write;

fn main() {
    let mut out = Vec::new();
    for val in 0u8..=255 {
        out.write_all(&[val, val]).unwrap();
    }

    out.write_all(&[0]).unwrap();
}

[profile.release]
lto = true
codegen-units = 1

connorskees on 20 Jul 2020

👍2

All 23 comments

@rustbot ping llvm

LeSeulArtichaut on 19 Jul 2020

Hey LLVM ICE-breakers! This bug has been identified as a good
"LLVM ICE-breaking candidate". In case it's useful, here are some
[instructions] for tackling these sorts of bugs. Maybe take a look?
Thanks! <3

cc @camelid @comex @cuviper @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj

rustbot on 19 Jul 2020

It appears to work fine on ARM Linux in my limited testing.

@rustbot ping windows

LeSeulArtichaut on 19 Jul 2020

Hey Windows Group! This bug has been identified as a good "Windows candidate".
In case it's useful, here are some [instructions] for tackling these sorts of
bugs. Maybe take a look?
Thanks! <3

cc @arlosi @danielframpton @gdr-at-ms @kennykerr @luqmana @lzybkr @retep998 @rylev @sivadeilra

rustbot on 19 Jul 2020

Works fine on Windows-gnu:

$ cargo run --release
   Compiling bmp-minimal v0.1.0 (C:\Users\mateusz\AppData\Local\Temp\tmp)
    Finished release [optimized] target(s) in 3.71s
     Running `target\release\bmp-minimal.exe`
1025 bytes written

With MSVC it endlessly allocates memory.

@rustbot modify labels: -O-windows +O-windows-msvc

mati865 on 19 Jul 2020

👍2

A bit smaller,

use std::io::Write;

fn main() {
    let mut out = Vec::new();
    for val in 0u8..=255 {
        out.write_all(&[val, val]).unwrap();
    }

    out.write_all(&[0]).unwrap();
}

[profile.release]
lto = true
codegen-units = 1

connorskees on 20 Jul 2020

👍2

Another datapoint: adding -C llvm-args=-cvp-dont-add-nowrap-flags makes it work.

That flag changed from default true to false about a year ago which kinda lines up with the repro chart above (minus 1.43.1). There was an LLVM bug about that causing some miscompilations last I checked.

cc @nikic who I believe worked on that upstream.

luqmana on 20 Jul 2020

@luqmana Nowrap handling in LFTR post-inc canonicalization was fixed when the CVP flag was enabled, so this is most likely a new issue. Can you share how the optimized IR looks like on Windows? In particular it would be interesting to see whether it already infinitely loops there, or whether this goes wrong in the backend (e.g. in LSR).

nikic on 20 Jul 2020

> $Env:RUSTFLAGS="-C llvm-args=-cvp-dont-add-nowrap-flags"
> cargo run --release
   Compiling bmp-minimal v0.1.0 (C:\X\bmp-minimal)
    Finished release [optimized] target(s) in 2.39s
     Running `target\release\bmp-minimal.exe`
memory allocation of 13421772800 bytes failederror: process didn't exit successfully: `target\release\bmp-minimal.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)
> rustc --version
rustc 1.45.0 (5c1f21c3b 2020-07-13)

Still fails on my end, unless I'm doing something wrong.

AlyoshaVasilieva on 20 Jul 2020

Assigning P-critical as discussed as part of the Prioritization Working Group procedure and removing I-prioritize.

JohnTitor on 21 Jul 2020

Reproduced on LLVM master and simplified: Run opt -loop-reduce -S on this:

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

declare void @dummy(i8 zeroext)

define void @test() {
  br label %1

1:
  %2 = phi i8 [ 0, %0 ], [ %4, %1 ]
  call void @dummy(i8 %2)

  %3 = icmp eq i8 %2, -1
  %4 = add nuw i8 %2, 1
  br i1 %3, label %5, label %1

5:
  ret void
}

The icmp gets changed to comparing the output of the 8-bit addition to 0, even though it will be poison if the input was 255 (because the add is marked as nuw).

comex on 21 Jul 2020

👀1 👍1

Heh, so this is indeed the same as the LFTR bug, just for LSR. Surprised it took this long to surface.

nikic on 21 Jul 2020

Thanks @comex, that's a lot simpler than the repro I ended up getting down to and it also demonstrates it's not windows specific.

luqmana on 21 Jul 2020

@AlyoshaVasilieva commented:

> $Env:RUSTFLAGS="-C llvm-args=-cvp-dont-add-nowrap-flags"
> cargo run --release
   Compiling bmp-minimal v0.1.0 (C:\X\bmp-minimal)
    Finished release [optimized] target(s) in 2.39s
     Running `target\release\bmp-minimal.exe`
memory allocation of 13421772800 bytes failederror: process didn't exit successfully: `target\release\bmp-minimal.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)
> rustc --version
rustc 1.45.0 (5c1f21c3b 2020-07-13)

Still fails on my end, unless I'm doing something wrong.

I'm no windows expert, but you might be doing something wrong...

Or at least, here is my local transcript:

c:\Users\pfxbo\issue_74498>set RUSTFLAGS=
set RUSTFLAGS=

c:\Users\pfxbo\issue_74498>echo %RUSTFLAGS%
echo %RUSTFLAGS%
%RUSTFLAGS%

c:\Users\pfxbo\issue_74498>cargo run --release
cargo run --release
   Compiling issue_74498 v0.1.0 (C:\Users\pfxbo\issue_74498)
    Finished release [optimized] target(s) in 2.25s
     Running `target\release\issue_74498.exe`
memory allocation of 53687091200 bytes failederror: process didn't exit successfully: `target\release\issue_74498.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

c:\Users\pfxbo\issue_74498>set RUSTFLAGS=-Cllvm-args=-cvp-dont-add-nowrap-flags
set RUSTFLAGS=-Cllvm-args=-cvp-dont-add-nowrap-flags

c:\Users\pfxbo\issue_74498>echo %RUSTFLAGS%
echo %RUSTFLAGS%
-Cllvm-args=-cvp-dont-add-nowrap-flags

c:\Users\pfxbo\issue_74498>cargo run --release
cargo run --release
   Compiling issue_74498 v0.1.0 (C:\Users\pfxbo\issue_74498)
    Finished release [optimized] target(s) in 2.18s
     Running `target\release\issue_74498.exe`
1025 bytes written

c:\Users\pfxbo\issue_74498>cargo --version
cargo --version
cargo 1.44.1 (88ba85757 2020-06-11)

c:\Users\pfxbo\issue_74498>rustc --version
rustc --version
rustc 1.44.1 (c7087fe00 2020-06-17)

c:\Users\pfxbo\issue_74498>

Having said that, I did see some evidence in some other runs that maybe setting this flag doesn't resolve the issue in all cases. I haven't gotten to the bottom of those observations yet; I just wanted to note that I had observed some sensitivity to the flag.

pnkfelix on 24 Jul 2020

@pnkfelix I believe the arg is being set correctly, because:

> $Env:RUSTFLAGS="-C llvm-args=-cvp-dont-add-nowrap-flags-GARBAGE-GARBAGE"
> cargo run --release
error: failed to run `rustc` to learn about target-specific information

Caused by:
  process didn't exit successfully: `rustc - --crate-name ___ --print=file-names -C llvm-args=-cvp-dont-add-nowrap-flags-GARBAGE-GARBAGE --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=cfg` (exit code: 1)
--- stderr
rustc: Unknown command line argument '-cvp-dont-add-nowrap-flags-GARBAGE-GARBAGE'.  Try: 'rustc --help'
rustc: Did you mean '--cvp-dont-add-nowrap-flags'?

> $Env:RUSTFLAGS="-C llvm-args=-cvp-dont-add-nowrap-flags"
> cargo run --release
    Finished release [optimized] target(s) in 0.07s
     Running `target\release\bmp-minimal.exe`
memory allocation of 13421772800 bytes failederror: process didn't exit successfully: `target\release\bmp-minimal.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

If RUSTFLAGS wasn't being set properly, then the garbage arg wouldn't cause an issue.

Above is using Powershell; I also tried using cmd using your commands:

>set RUSTFLAGS=-Cllvm-args=-cvp-dont-add-nowrap-flags
>echo %RUSTFLAGS%
-Cllvm-args=-cvp-dont-add-nowrap-flags
>cargo run --release
   Compiling bmp-minimal v0.1.0 (C:\X\bmp-minimal)
    Finished release [optimized] target(s) in 4.46s
     Running `target\release\bmp-minimal.exe`
memory allocation of 13421772800 bytes failederror: process didn't exit successfully: `target\release\bmp-minimal.exe` (exit code: 0xc0000409, STATUS_STACK_BUFFER_OVERRUN)

and received the same failure.

I see you are using Rust 1.44.1 in your example. I tried using that arg on Rust 1.44.1 and it does indeed cause a successful run. On 1.45, however, it fails with or without the arg. (of course, there may be some other thing causing the run to succeed/fail; this is way outside my knowledge base)

AlyoshaVasilieva on 24 Jul 2020

Here's the reduced IR of the minimal rust code here: https://github.com/rust-lang/rust/issues/74498#issuecomment-660722030, a bit simplified and with some renaming

target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc19.26.28806"

define i32 @main() unnamed_addr {
entry:
  %proc_heap = tail call i8* @GetProcessHeap()
  %vec_ptr0 = tail call i8* @HeapAlloc(i8* %proc_heap, i32 0, i64 8)
  br label %loop_hdr
loop_hdr:                                         ; preds = %loop_write, %entry
  %vec_ptr1 = phi i8* [ %vec_ptr0, %entry ], [ %vec_ptr2, %loop_write ]
  %vec_cap1 = phi i64 [ 8, %entry ], [ %new_vec_cap1, %loop_write ]
  %vec_len1 = phi i64 [ 0, %entry ], [ %new_vec_len1, %loop_write ]
  %val = phi i8 [ 0, %entry ], [ %1, %loop_write ]
  %0 = icmp eq i8 %val, 255
  %1 = add nuw i8 %val, 1
  %2 = sub i64 %vec_cap1, %vec_len1
  %3 = icmp ult i64 %2, 2
  br i1 %3, label %check_realloc, label %pre_loop_write
check_realloc:                   ; preds = %loop_hdr
  %len_plus_two_checked = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %vec_len1, i64 2)
  %checked_res = extractvalue { i64, i1 } %len_plus_two_checked, 1
  br i1 %checked_res, label %exit, label %realloc
realloc:                         ; preds = %check_realloc
  %len_plus_two = extractvalue { i64, i1 } %len_plus_two_checked, 0
  %cap_times_two = shl i64 %vec_cap1, 1
  %4 = icmp ugt i64 %cap_times_two, %len_plus_two
  %new_cap = select i1 %4, i64 %cap_times_two, i64 %len_plus_two
  %new_vec_ptr = tail call i8* @HeapReAlloc(i8* %proc_heap, i32 0, i8* %vec_ptr1, i64 %new_cap)
  br label %loop_write
pre_loop_write:                                             ; preds = %loop_hdr
  %new_vec_len2 = add i64 %vec_len1, 2
  br label %loop_write
loop_write:                                       ; preds = %realloc, %pre_loop_write
  %new_vec_len1 = phi i64 [ %new_vec_len2, %pre_loop_write ], [ %len_plus_two, %realloc ]
  %vec_ptr2 = phi i8* [ %vec_ptr1, %pre_loop_write ], [ %new_vec_ptr, %realloc ]
  %new_vec_cap1 = phi i64 [ %vec_cap1, %pre_loop_write ], [ %new_cap, %realloc ]
  %vec_store_slot1 = getelementptr inbounds i8, i8* %vec_ptr2, i64 %vec_len1
  store i8 %val, i8* %vec_store_slot1, align 1
  %vec_store_slot2 = getelementptr inbounds i8, i8* %vec_store_slot1, i64 1
  store i8 %val, i8* %vec_store_slot2, align 1
  br i1 %0, label %exit, label %loop_hdr
exit:                                             ; preds = %loop_write, %check_realloc
  %exit_code = phi i32 [ 0, %loop_write ], [ 1, %check_realloc ]
  ret i32 %exit_code
}
declare { i64, i1 } @llvm.uadd.with.overflow.i64(i64, i64)
declare i8* @GetProcessHeap() unnamed_addr
declare i8* @HeapAlloc(i8*, i32, i64) unnamed_addr
declare i8* @HeapReAlloc(i8*, i32, i8*, i64) unnamed_addr

Running this through clang -O (I tried v 10.0.0), the resulting binary will continue to allocate memory in an infinite loop.

Running it through opt -loop-reduce will return a slightly different IR:

loop_hdr:
[...]
  %0 = add nuw i8 %val, 1
[...]
loop_write:
[...]
  %4 = icmp eq i8 %0, 0
  br i1 %4, label %exit, label %loop_hdr

This is the same issue @comex mentioned above. LLVM should be clearing the no wrap flag on the add there after transforming to a post-increment check but it doesn't. Then later it just gets rid of the exit condition altogether since it believes that the add will never wrap, it can't possibly be equal to 0.

But that doesn't explain why the flag doesn't seem to work post 1.44, looking at the IR (with -cvp-dont-add-nowrap-flags=true):

1.44.1:

%_5.0.i.i.i = add i8 %iter.sroa.0.060, 1

1.45+:

%7 = add nuw i8 %iter.sroa.0.063, 1

So in 1.45+ we're not relying on LLVM's CVP to determine that the add has no unsigned wrap but some other mechanism. But after looking through print-after-all for a bit I'm pretty sure this new line in Range::next is the reason:
https://github.com/rust-lang/rust/blob/98efae87607aabd7c30b879befe61bf9c29eb978/library/core/src/iter/range.rs#L507

That boils down to an unchecked_add which in LLVM becomes LLVMBuildNUWAdd in this case.

Since that code was introduced in https://github.com/rust-lang/rust/pull/69659 that lines up with the behaviour we see in 1.45+.

Either way, still an LLVM bug that's just triggered more now.

luqmana on 25 Jul 2020

thank you for that excellent analysis @luqmana

pnkfelix on 27 Jul 2020

Filed https://bugs.llvm.org/show_bug.cgi?id=46943 with LLVM upstream.

pnkfelix on 1 Aug 2020

The only remaining question on our end is if we should, for the short term, change that range.rs code to not used an unchecked_add, in order to (hopefully) bypass the misoptimization here.

(That, or try to figure out a patch to LLVM to fix the bug on their end.)

pnkfelix on 1 Aug 2020

To be clear: my thinking is that we could change the range.rs and/or Step code to use wrapping_add rather than unchecked_add.

That way, we shouldn't generate any overflow checking code, so I wouldn't expect any significant performance hit.

At the same time, I'm not really a big fan of putting in such narrowly targetted hacks to work around LLVM bugs, especially ones that have been confirmed to be bugs on LLVM's end.

pnkfelix on 4 Aug 2020

At the same time, I'm not really a big fan of putting in such narrowly targetted hacks to work around LLVM bugs, especially ones that have been confirmed to be bugs on LLVM's end.

(having said that, this bug does seem to serve as evidence that our code is probably executing an overflowing addition via unchecked_add. Even though that "merely" generates poison under this particular codegen path (that should be unused and thus won't cause UB), strictly speaking, it is a violation of the contract written for Rust's unchecked_add:

Returns the result of an unchecked addition, resulting in undefined behavior when x + y > T::MAX or x + y < T::MIN.

I'm going to go in and confirm that this add may indeed be overflowing, and if it is, I probably will push for changing to wrapping_add here instead.

Update: okay, its hard to see how this could possibly be causing an overflowing addition under Rust's machine model:

        if self.start < self.end {
            // SAFETY: just checked precondition
            let n = unsafe { Step::forward_unchecked(self.start.clone(), 1) };
            ...
        }

pnkfelix on 4 Aug 2020

Downgrading priority to P-high as discussed on the compiler team meeting.

spastorino on 20 Aug 2020

I have a patch that fixes (what I believe to be) the underlying LLVM bug; see https://bugs.llvm.org/show_bug.cgi?id=46943 ; I'm working on making a LLVM unit test and then I'll put up a patch for the llvm-devs to review.

pnkfelix on 12 Nov 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings