A user reported this to Criterion.rs - https://github.com/japaric/criterion.rs/issues/133
The short version is this - when the user's test crate's benchmarks are run on 1.24.0 or newer, Criterion.rs calculates some values incorrectly even though the code looks correct to me. They claim to have reproduced this behavior on Arch Linux, Windows and Raspbian. I have only been able to confirm it on Windows. Each of us has confirmed this behavior on multiple machines.
I was initially reluctant to call this a miscompilation, but when I started investigating it, any change I made caused the bug to stop happening - inserting println's, or even pushing values into a vector and printing them later caused the bug to disappear. Eventually I tried simply adding a call to test::black_box, which should have no observable effect on the code except to inhibit certain compiler optimizations. That also caused the bug to stop occurring. It may still be a bug in my code, but if so I can't find it.
I've tried to create a minimal test case, but was unsuccessful. This bug is quite fragile.
Steps to reproduce:
criterion (this isn't necessary but will save some compilation time)1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark -- --verbose.Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2469702500 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (132784700 iterations)
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (2518899600 iterations)
...
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (2514900000 iterations)
rustc_fix branch of https://github.com/japaric/criterion.rs and modify the Cargo.toml file of criterion-test.rs to use that instead. This patch merely enables the test feature/crate and inserts one call to test::black_box in routine.rs:warm_up. Verify that the expected behavior is restored.Manual bisection log using rustup:
good rustc 1.24.0-nightly (edbd7d232 2017-12-20)
bad rustc 1.25.0-nightly (73ac5d6a8 2018-01-11)
bad rustc 1.24.0-nightly (9389e23a8 2017-12-31)
good rustc 1.24.0-nightly (c284f8807 2017-12-24)
bad rustc 1.24.0-nightly (1abeb436d 2017-12-27) (nightly-2017-12-28 for rustup)
good rustc 1.24.0-nightly (cddc4a62d 2017-12-26) (nightly-2017-12-27 for rustup)
So the change is inside the commit range cddc4a62d...1abeb436d .... I only see bugfixes and ICE fixes...
Old state:
$ cargo +nightly-2017-12-27 bench --bench my_benchmark -- --verbose
Finished release [optimized] target(s) in 0.0 secs
Running target/release/deps/my_benchmark-c6ffda0acddb4b02
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (1316272400 iterations)
Benchmarking fib 1: Analyzing
fib 1 time: [767.18 ps 778.06 ps 789.83 ps]
change: [-3.1129% -1.4570% +0.2070%] (p = 0.09 > 0.05)
No change in performance detected.
Found 17 outliers among 100 measurements (17.00%)
2 (2.00%) high mild
15 (15.00%) high severe
slope [767.18 ps 789.83 ps] R^2 [0.7407196 0.7387507]
mean [764.74 ps 782.08 ps] std. dev. [29.638 ps 57.337 ps]
median [752.23 ps 756.69 ps] med. abs. dev. [3.0748 ps 9.0096 ps]
Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (224447250 iterations)
Benchmarking fib 2: Analyzing
fib 2 time: [4.3392 ns 4.3744 ns 4.4208 ns]
change: [-0.6421% +1.1830% +2.7979%] (p = 0.19 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe
slope [4.3392 ns 4.4208 ns] R^2 [0.8011943 0.7965177]
mean [4.3847 ns 4.4583 ns] std. dev. [130.47 ps 247.30 ps]
median [4.3365 ns 4.4123 ns] med. abs. dev. [111.97 ps 189.57 ps]
New state:
$ cargo +nightly-2017-12-28 bench --bench my_benchmark -- --verbose
Finished release [optimized] target(s) in 0.0 secs
Running target/release/deps/my_benchmark-08eeef1a988add44
Benchmarking fib 1
Benchmarking fib 1: Warming up for 3.0000 s
Benchmarking fib 1: Collecting 100 samples in estimated 1.0000 s (3763674100 iterations)
Benchmarking fib 1: Analyzing
fib 1 time: [266.56 ps 267.97 ps 269.51 ps]
change: [-65.621% -65.152% -64.693%] (p = 0.00 < 0.05)
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
2 (2.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
slope [266.56 ps 269.51 ps] R^2 [0.9507674 0.9500646]
mean [267.64 ps 271.29 ps] std. dev. [5.0293 ps 13.253 ps]
median [267.92 ps 269.51 ps] med. abs. dev. [2.5710 ps 5.5603 ps]
Benchmarking fib 2
Benchmarking fib 2: Warming up for 3.0000 s
Benchmarking fib 2: Collecting 100 samples in estimated 1.0000 s (3787974700 iterations)
Benchmarking fib 2: Analyzing
fib 2 time: [4.9535 ns 4.9735 ns 4.9952 ns]
change: [+11.531% +12.592% +13.578%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild
slope [4.9535 ns 4.9952 ns] R^2 [0.9572156 0.9568266]
mean [4.9577 ns 4.9946 ns] std. dev. [76.681 ps 110.54 ps]
median [4.9522 ns 4.9900 ns] med. abs. dev. [55.560 ps 100.25 ps]
As @kennytm rightfully pointed out, this lies within the 90 day limit. So we can in fact compare single commits!
Further bisection gives me:
bad 71ed31f24
good e8098c5c4
good 3fd27b271
So the change is caused by 3fd27b2718438ceec6243869cde3edde7e5a904e...71ed31f24054863b807f68f244c6828ce6d2e516 , PR #47007 by @eddyb . cc @rust-lang/compiler
Usually I only see a performance impact from some changes. Is there unsafe code involved? My fix could've let LLVM optimize some cases it never did before, which might break unsafe code.
There is unsafe code elsewhere in Criterion.rs, yes. Not in this part of the code though.
So it seems like there are only two possible ways forward:
We're going to call this P-high for now and revisit next week. I think the most plausible way to get to the bottom of this bug though is to compare the generated code with and without the workaround and try to see what optimization is being inhibited.
triage: P-high
I looked at the unsafe code in Criterion, briefly, and none of it seems capable of anything like this, so I'm assuming LLVM bug or unfixed Rust bug that affects optimizations.
I don't even think the alignment changed in #47007, only some LLVM optimizations unblocked.
I took another shot at minimizing this, taking a different approach. Rather than try to construct a minimal case from scratch, I instead made a branch of Criterion.rs and started deleting everything I could delete without masking the bug or breaking the test case. I can't guarantee that this is a minimal case, but it's much closer to it than I was able to provide before. It also compiles much faster than previously, which should make it easier to iterate.
Some things I discovered while doing so:
routine.rs to lib.rs causes the bug to stop occurring.Updated steps to reproduce:
rustc_minimal branch of https://github.com/japaric/criterion.rs/1.23.0-x86_64-pc-windows-msvc, run cargo bench --bench my_benchmark in criterion-test.rs.I hope this helps. If you need any further help tracking this down, let me know.
Had a look at this again. TLDR: This doesn't seem like a miscompilation.
Looking at @mbillingr's example code:
fn fibonacci(n: u64) -> u64 {
match n {
0 => 1,
1 => 1,
n => fibonacci(n-1) + fibonacci(n-2),
}
}
fn criterion_benchmark1(c: &mut Criterion) {
c.bench_function("fib 1", |b| b.iter(|| fibonacci(1)));
}
fn criterion_benchmark2(c: &mut Criterion) {
c.bench_function("fib 2", |b| b.iter(|| fibonacci(3)));
}
We can see benchmarks being performed on fibonacci calculating functions. With this code I see the following behaviour:
before:
fib 1: Calculated iterations: 1248410500
fib 2: Calculated iterations: 176871200
after:
fib 1: Calculated iterations: 3885823500
fib 2: Calculated iterations: 3739878500
here, "before" means nightly-2017-12-27 and "after" means nightly-2017-12-28 from rustup.
Now let's change the two functions a bit by evaluating the fibonacci expression:
fn criterion_benchmark1(c: &mut Criterion) {
c.bench_function("fib 1", |b| b.iter(|| 1));
}
fn criterion_benchmark2(c: &mut Criterion) {
c.bench_function("fib 2", |b| b.iter(|| 3));
}
We get:
before:
fib 1: Calculated iterations: 1282801000
fib 2: Calculated iterations: 1242027300
after:
fib 1: Calculated iterations: 3879278700
fib 2: Calculated iterations: 3869022150
Let's change it again to use black_box:
#![feature(test)]
extern crate test;
use test::black_box;
fn criterion_benchmark1(c: &mut Criterion) {
c.bench_function("fib 1", |b| b.iter(|| fibonacci(black_box(3))));
}
fn criterion_benchmark2(c: &mut Criterion) {
c.bench_function("fib 2", |b| b.iter(|| fibonacci(black_box(3))));
}
before:
fib 1: Calculated iterations: 473144600
fib 2: Calculated iterations: 206302600
after:
fib 1: Calculated iterations: 647743300
fib 2: Calculated iterations: 222977700
So for me it seems that @eddyb 's change enabled an optimization that allowed LLVM to eval the fibonacci(INTEGER_LITERAL) expression at compile time. Basically just LLVM doing its job.
Forgive me if I've missed something, but I don't think that explains everything. If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization? For that matter, it doesn't explain why moving the warmup function into the lib module without changing it also breaks the optimization.
Edited to add:
There's another important aspect that remains unexplained. Criterion.rs uses this iteration calculation to estimate how many times the benchmarked function should be called in order to take up a pre-configured measurement time. The original bug report that led to my raising this issue was that the incorrect iteration count was causing Criterion.rs to take much longer (minutes when it should be seconds) than expected in the measurement phase because it was iterating the benchmark many more times than it should have. If, as you suggest, the change is simply because an optimization made the fibonacci function faster and the iteration count is correct, then why is that iteration count resulting in an observed measurement time taking so much longer than it should be? That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.
If it is just LLVM correctly optimizing the code, then why does black-boxing an apparently-unrelated value in the warmup routine break the optimization?
Optimizers often have weird behaviour. They rely on metrics on whether to optimize something or not.
That could only happen if the fibonacci function was optimized more during warmup than it is during measurement.
Very much possible. LLVM could first inline and then optimize it to a constant.
I'm starting to think you might be right. I had discounted the possibility that the benchmarked function might get inlined all the way into warmup and measurement because I thought Rust couldn't inline functions across crates without #[inline], but when I add #[inline(never)] to the Bencher iteration methods that also seems to produce the expected results.
I thought Rust couldn't inline functions across crates without
#[inline]
Depends, are there any type parameters? If so, you can't compile without instantiating cross-crate.
Ah, of course. You're correct, I'd forgotten about that. I think you're right, which suggests that the solution to this is to add a few carefully-chosen #[inline(never)] annotations. I've already tried that, and it does seem to work. Thanks for the help, and sorry for the false alarm!
Most helpful comment
I took another shot at minimizing this, taking a different approach. Rather than try to construct a minimal case from scratch, I instead made a branch of Criterion.rs and started deleting everything I could delete without masking the bug or breaking the test case. I can't guarantee that this is a minimal case, but it's much closer to it than I was able to provide before. It also compiles much faster than previously, which should make it easier to iterate.
Some things I discovered while doing so:
routine.rstolib.rscauses the bug to stop occurring.Updated steps to reproduce:
rustc_minimalbranch of https://github.com/japaric/criterion.rs/1.23.0-x86_64-pc-windows-msvc, runcargo bench --bench my_benchmarkin criterion-test.rs.fib 1: Calculated iterations: (2422959700 iterations)
fib 3: Calculated iterations: (127573100 iterations)
fib 1: Calculated iterations: (2520631750 iterations)
fib 3: Calculated iterations: (2521313500 iterations)
I hope this helps. If you need any further help tracking this down, let me know.