Rust: Possible miscompilation in >= 1.24.0

Created on 24 Mar 2018 · 15Comments · Source: rust-lang/rust

A user reported this to Criterion.rs - https://github.com/japaric/criterion.rs/issues/133

The short version is this - when the user's test crate's benchmarks are run on 1.24.0 or newer, Criterion.rs calculates some values incorrectly even though the code looks correct to me. They claim to have reproduced this behavior on Arch Linux, Windows and Raspbian. I have only been able to confirm it on Windows. Each of us has confirmed this behavior on multiple machines.

I was initially reluctant to call this a miscompilation, but when I started investigating it, any change I made caused the bug to stop happening - inserting println's, or even pushing values into a vector and printing them later caused the bug to disappear. Eventually I tried simply adding a call to test::black_box, which should have no observable effect on the code except to inhibit certain compiler optimizations. That also caused the bug to stop occurring. It may still be a bug in my code, but if so I can't find it.

I've tried to create a minimal test case, but was unsuccessful. This bug is quite fragile.

Steps to reproduce:

Clone https://github.com/mbillingr/criterion-test.rs
Edit the Cargo.toml file to disable the default features for criterion (this isn't necessary but will save some compilation time)
With 1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark -- --verbose.
- Criterion.rs will run two benchmarks and report two iteration counts. The second should be significantly smaller than the first - this is the desired behavior. For example:

Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2469702500 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (132784700 iterations)

Switch to 1.24.1 and run the benchmark command again. Note that this time, the second iteration count is roughly the same as the first (this is the unexpected behavior):

Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2518899600 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (2514900000 iterations)

Optional: Switch to a nightly compiler and verify that the unexpected behavior persists. Clone the rustc_fix branch of https://github.com/japaric/criterion.rs and modify the Cargo.toml file of criterion-test.rs to use that instead. This patch merely enables the test feature/crate and inserts one call to test::black_box in routine.rs:warm_up. Verify that the expected behavior is restored.

C-bug P-high T-compiler regression-from-stable-to-stable

Source

bheisler

👍1

Most helpful comment

I took another shot at minimizing this, taking a different approach. Rather than try to construct a minimal case from scratch, I instead made a branch of Criterion.rs and started deleting everything I could delete without masking the bug or breaking the test case. I can't guarantee that this is a minimal case, but it's much closer to it than I was able to provide before. It also compiles much faster than previously, which should make it easier to iterate.

Some things I discovered while doing so:

Unsafe code was not at fault - the bug still occurs with all unsafe code removed.
None of Criterion.rs' dependencies are at fault - the bug still occurs with all dependencies except std removed.
Moving the Routine trait and associated code from routine.rs to lib.rs causes the bug to stop occurring.

Updated steps to reproduce:

Clone rustc_minimal branch of https://github.com/japaric/criterion.rs/
Clone https://github.com/mbillingr/criterion-test.rs
Edit the Cargo.toml file of criterion-test.rs to use the path to criterion.rs
With 1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark in criterion-test.rs.
- Criterion.rs will run two benchmark warmups and report two iteration counts. The second should be significantly smaller than the first - this is the desired behavior. For example:
  
  fib 1: Calculated iterations: (2422959700 iterations)
  
  fib 3: Calculated iterations: (127573100 iterations)
Switch to 1.24.1 and run the benchmark command again. Note that this time, the second iteration count is roughly the same as the first (this is the unexpected behavior):
fib 1: Calculated iterations: (2520631750 iterations)
fib 3: Calculated iterations: (2521313500 iterations)

I hope this helps. If you need any further help tracking this down, let me know.

bheisler on 8 Apr 2018

❤3

All 15 comments

Manual bisection log using rustup:

good rustc 1.24.0-nightly (edbd7d232 2017-12-20)
bad rustc 1.25.0-nightly (73ac5d6a8 2018-01-11)
bad rustc 1.24.0-nightly (9389e23a8 2017-12-31)
good rustc 1.24.0-nightly (c284f8807 2017-12-24)
bad rustc 1.24.0-nightly (1abeb436d 2017-12-27) (nightly-2017-12-28 for rustup)
good rustc 1.24.0-nightly (cddc4a62d 2017-12-26) (nightly-2017-12-27 for rustup)

So the change is inside the commit range cddc4a62d...1abeb436d .... I only see bugfixes and ICE fixes...

Old state:

$ cargo +nightly-2017-12-27 bench --bench my_benchmark -- --verbose
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/my_benchmark-c6ffda0acddb4b02
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (1316272400 iterations)
Benchmarking fib 1: Analyzing
fib 1                   time:   [767.18 ps 778.06 ps 789.83 ps]
                        change: [-3.1129% -1.4570% +0.2070%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
  2 (2.00%) high mild
  15 (15.00%) high severe
slope  [767.18 ps 789.83 ps] R^2            [0.7407196 0.7387507]
mean   [764.74 ps 782.08 ps] std. dev.      [29.638 ps 57.337 ps]
median [752.23 ps 756.69 ps] med. abs. dev. [3.0748 ps 9.0096 ps]

Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (224447250 iterations)
Benchmarking fib 2: Analyzing
fib 2                   time:   [4.3392 ns 4.3744 ns 4.4208 ns]
                        change: [-0.6421% +1.1830% +2.7979%] (p = 0.19 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
slope  [4.3392 ns 4.4208 ns] R^2            [0.8011943 0.7965177]
mean   [4.3847 ns 4.4583 ns] std. dev.      [130.47 ps 247.30 ps]
median [4.3365 ns 4.4123 ns] med. abs. dev. [111.97 ps 189.57 ps]

New state:

$ cargo +nightly-2017-12-28 bench --bench my_benchmark -- --verbose
    Finished release [optimized] target(s) in 0.0 secs
     Running target/release/deps/my_benchmark-08eeef1a988add44
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (3763674100 iterations)
Benchmarking fib 1: Analyzing
fib 1                   time:   [266.56 ps 267.97 ps 269.51 ps]
                        change: [-65.621% -65.152% -64.693%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
slope  [266.56 ps 269.51 ps] R^2            [0.9507674 0.9500646]
mean   [267.64 ps 271.29 ps] std. dev.      [5.0293 ps 13.253 ps]
median [267.92 ps 269.51 ps] med. abs. dev. [2.5710 ps 5.5603 ps]

Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (3787974700 iterations)
Benchmarking fib 2: Analyzing
fib 2                   time:   [4.9535 ns 4.9735 ns 4.9952 ns]
                        change: [+11.531% +12.592% +13.578%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
slope  [4.9535 ns 4.9952 ns] R^2            [0.9572156 0.9568266]
mean   [4.9577 ns 4.9946 ns] std. dev.      [76.681 ps 110.54 ps]
median [4.9522 ns 4.9900 ns] med. abs. dev. [55.560 ps 100.25 ps]

est31 on 24 Mar 2018

As @kennytm rightfully pointed out, this lies within the 90 day limit. So we can in fact compare single commits!

Further bisection gives me:

bad 71ed31f24
good e8098c5c4
good 3fd27b271

So the change is caused by 3fd27b2718438ceec6243869cde3edde7e5a904e...71ed31f24054863b807f68f244c6828ce6d2e516 , PR #47007 by @eddyb . cc @rust-lang/compiler

est31 on 24 Mar 2018

👍2

Usually I only see a performance impact from some changes. Is there unsafe code involved? My fix could've let LLVM optimize some cases it never did before, which might break unsafe code.

eddyb on 24 Mar 2018

There is unsafe code elsewhere in Criterion.rs, yes. Not in this part of the code though.

bheisler on 24 Mar 2018

So it seems like there are only two possible ways forward:

Continue attempting to reduce the problem somehow (perhaps starting from the generated LLVM and using creduce or something like that?)
Work backwards from the broken execution and try to see why it continues going around the loop -- I suppose we could set a breakpoint for one of the middle iterations, or just study and compare the generated code.

nikomatsakis on 5 Apr 2018

We're going to call this P-high for now and revisit next week. I think the most plausible way to get to the bottom of this bug though is to compare the generated code with and without the workaround and try to see what optimization is being inhibited.

nikomatsakis on 5 Apr 2018

triage: P-high

nikomatsakis on 5 Apr 2018

I looked at the unsafe code in Criterion, briefly, and none of it seems capable of anything like this, so I'm assuming LLVM bug or unfixed Rust bug that affects optimizations.
I don't even think the alignment changed in #47007, only some LLVM optimizations unblocked.

eddyb on 5 Apr 2018

Some things I discovered while doing so:

Unsafe code was not at fault - the bug still occurs with all unsafe code removed.
None of Criterion.rs' dependencies are at fault - the bug still occurs with all dependencies except std removed.
Moving the Routine trait and associated code from routine.rs to lib.rs causes the bug to stop occurring.

Updated steps to reproduce:

Clone rustc_minimal branch of https://github.com/japaric/criterion.rs/
Clone https://github.com/mbillingr/criterion-test.rs
Edit the Cargo.toml file of criterion-test.rs to use the path to criterion.rs
With 1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark in criterion-test.rs.
- Criterion.rs will run two benchmark warmups and report two iteration counts. The second should be significantly smaller than the first - this is the desired behavior. For example:
  
  fib 1: Calculated iterations: (2422959700 iterations)
  
  fib 3: Calculated iterations: (127573100 iterations)
Switch to 1.24.1 and run the benchmark command again. Note that this time, the second iteration count is roughly the same as the first (this is the unexpected behavior):
fib 1: Calculated iterations: (2520631750 iterations)
fib 3: Calculated iterations: (2521313500 iterations)

I hope this helps. If you need any further help tracking this down, let me know.

bheisler on 8 Apr 2018

❤3

Had a look at this again. TLDR: This doesn't seem like a miscompilation.

Looking at @mbillingr's example code:

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        n => fibonacci(n-1) + fibonacci(n-2),
    }
}
fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| fibonacci(1)));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| fibonacci(3)));
}

We can see benchmarks being performed on fibonacci calculating functions. With this code I see the following behaviour:

before:
fib 1: Calculated iterations: 1248410500
fib 2: Calculated iterations: 176871200
after:
fib 1: Calculated iterations: 3885823500
fib 2: Calculated iterations: 3739878500

here, "before" means nightly-2017-12-27 and "after" means nightly-2017-12-28 from rustup.

Now let's change the two functions a bit by evaluating the fibonacci expression:

fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| 1));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| 3));
}

We get:

before:
fib 1: Calculated iterations: 1282801000
fib 2: Calculated iterations: 1242027300
after:
fib 1: Calculated iterations: 3879278700
fib 2: Calculated iterations: 3869022150

Let's change it again to use black_box:

#![feature(test)]

extern crate test;

use test::black_box;

fn criterion_benchmark1(c: &mut Criterion) {
    c.bench_function("fib 1", |b| b.iter(|| fibonacci(black_box(3))));
}

fn criterion_benchmark2(c: &mut Criterion) {
    c.bench_function("fib 2", |b| b.iter(|| fibonacci(black_box(3))));
}

before:
fib 1: Calculated iterations: 473144600
fib 2: Calculated iterations: 206302600
after:
fib 1: Calculated iterations: 647743300
fib 2: Calculated iterations: 222977700

So for me it seems that @eddyb 's change enabled an optimization that allowed LLVM to eval the fibonacci(INTEGER_LITERAL) expression at compile time. Basically just LLVM doing its job.

est31 on 8 Apr 2018

Forgive me if I've missed something, but I don't think that explains everything. If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization? For that matter, it doesn't explain why moving the warmup function into the lib module without changing it also breaks the optimization.

Edited to add:
There's another important aspect that remains unexplained. Criterion.rs uses this iteration calculation to estimate how many times the benchmarked function should be called in order to take up a pre-configured measurement time. The original bug report that led to my raising this issue was that the incorrect iteration count was causing Criterion.rs to take much longer (minutes when it should be seconds) than expected in the measurement phase because it was iterating the benchmark many more times than it should have. If, as you suggest, the change is simply because an optimization made the fibonacci function faster and the iteration count is correct, then why is that iteration count resulting in an observed measurement time taking so much longer than it should be? That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.

bheisler on 8 Apr 2018

If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization?

Optimizers often have weird behaviour. They rely on metrics on whether to optimize something or not.

That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.

Very much possible. LLVM could first inline and then optimize it to a constant.

est31 on 8 Apr 2018

I'm starting to think you might be right. I had discounted the possibility that the benchmarked function might get inlined all the way into warmup and measurement because I thought Rust couldn't inline functions across crates without #[inline], but when I add #[inline(never)] to the Bencher iteration methods that also seems to produce the expected results.

bheisler on 9 Apr 2018

I thought Rust couldn't inline functions across crates without #[inline]

Depends, are there any type parameters? If so, you can't compile without instantiating cross-crate.

eddyb on 10 Apr 2018

Ah, of course. You're correct, I'd forgotten about that. I think you're right, which suggests that the solution to this is to add a few carefully-chosen #[inline(never)] annotations. I've already tried that, and it does seem to work. Thanks for the help, and sorry for the false alarm!

bheisler on 10 Apr 2018

Was this page helpful?

0 / 5 - 0 ratings