I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.
$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0
$ cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 5.51 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial ... bench: 7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel ... bench: 10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial ... bench: 7,733,258 ns/iter (+/- 134,850)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 6.48 secs
Running target/release/deps/set_sum-833cf161cf760670
running 4 tests
test rayon_set_sum_parallel ... bench: 1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial ... bench: 6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel ... bench: 8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial ... bench: 6,144,775 ns/iter (+/- 75,377)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.
From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:
16.72 โ126d0: cmpq $0x0,(%r12,%rbx,8)
6.73 โ126d5: โ jne 126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
16.65 โ126d7: inc %rbx
0.00 โ126da: cmp %rbp,%rbx
7.20 โ126dd: โ jb 126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
0.05 โ126df: โ jmp 1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
26.93 โ126e1: mov 0x0(%r13,%rbx,4),%eax
4.26 โ126e6: movq $0x1,0x38(%rsp)
2.27 โ126ef: mov %rax,0x40(%rsp)
1.88 โ126f4: mov %r14,%rdi
4.58 โ126f7: โ callq 15b90 <<u64 as core::iter::traits::Sum>::sum>
0.68 โ126fc: movq $0x1,0x38(%rsp)
2.58 โ12705: mov %r15,0x40(%rsp)
0.62 โ1270a: movq $0x1,0x48(%rsp)
0.31 โ12713: mov %rax,0x50(%rsp)
0.49 โ12718: movb $0x0,0x58(%rsp)
2.50 โ1271d: xor %esi,%esi
0.41 โ1271f: mov %r14,%rdi
0.85 โ12722: โ callq 153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
1.30 โ12727: mov %rax,%r15
2.16 โ1272a: โ jmp 126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>
With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:
2.28 โ1426d: test %rbx,%rbx
โ14270: โ je 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
5.40 โ14272: mov (%rbx),%ebx
2.75 โ14274: add %rbx,%rax
1.47 โ14277: lea (%rdx,%rsi,4),%rbx
0.21 โ1427b: nopl 0x0(%rax,%rax,1)
29.56 โ14280: cmp %rdi,%rsi
0.04 โ14283: โ jae 14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
2.87 โ14285: add $0x4,%rbx
20.22 โ14289: cmpq $0x0,(%rcx,%rsi,8)
1.48 โ1428e: lea 0x1(%rsi),%rsi
8.00 โ14292: โ je 14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
25.25 โ14294: โ jmp 1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>
@cuviper I wonder if we can add any #[inline] hints to help resolve this?
@nikomatsakis you mean in libstd? Neither Chain::fold nor u64::sum have #[inline] attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.
But I hope we don't have to generally recommend #[inline] all over the place...
I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.
@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.
FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:
$ rustc -Vv
rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13)
binary: rustc
commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032
commit-date: 2018-02-13
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 6.0
$ cargo bench --bench set_sum
Updating registry `https://github.com/rust-lang/crates.io-index`
Compiling [...]
Finished release [optimized] target(s) in 5.37 secs
Running target/release/deps/set_sum-d85cde9ece817246
running 4 tests
test rayon_set_sum_parallel ... bench: 1,420,138 ns/iter (+/- 34,709)
test rayon_set_sum_serial ... bench: 7,718,603 ns/iter (+/- 141,127)
test std_set_sum_parallel ... bench: 8,886,208 ns/iter (+/- 137,102)
test std_set_sum_serial ... bench: 7,845,670 ns/iter (+/- 106,685)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
Compiling [...]
Finished release [optimized] target(s) in 6.40 secs
Running target/release/deps/set_sum-d85cde9ece817246
running 4 tests
test rayon_set_sum_parallel ... bench: 1,089,784 ns/iter (+/- 167,357)
test rayon_set_sum_serial ... bench: 6,196,760 ns/iter (+/- 335,661)
test std_set_sum_parallel ... bench: 8,497,259 ns/iter (+/- 128,929)
test std_set_sum_serial ... bench: 6,151,935 ns/iter (+/- 76,954)
test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out
The profile of rayon_set_sum_parallel is now 49.5% bridge_unindexed_producer_consumer, 48% Sum::sum, with this hot loop:
19.74 โ14fa0: cmpq $0x0,(%r12,%rbx,8)
6.54 โ14fa5: โ jne 14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
21.06 โ14fa7: add $0x1,%rbx
โ14fab: cmp %rbp,%rbx
4.04 โ14fae: โ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
0.04 โ14fb0: โ jmp 14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
28.36 โ14fb2: mov 0x0(%r13,%rbx,4),%eax
4.34 โ14fb7: movq $0x1,0x58(%rsp)
5.45 โ14fc0: mov %rax,0x60(%rsp)
0.84 โ14fc5: mov %r14,%rdi
3.80 โ14fc8: โ callq 16380 <<u64 as core::iter::traits::Sum>::sum>
2.52 โ14fcd: add %rax,%r15
2.07 โ14fd0: add $0x1,%rbx
โ14fd4: cmp %rbp,%rbx
0.26 โ14fd7: โ jb 14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
versus codegen-units=1:
0.63 โ14092: lea (%rdi,%rcx,4),%rbp
0.01 โ14096: nopw %cs:0x0(%rax,%rax,1)
25.16 โ140a0: cmpq $0x0,(%rsi,%rcx,8)
7.68 โ140a5: โ jne 140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
28.34 โ140a7: add $0x1,%rcx
0.21 โ140ab: add $0x4,%rbp
โ140af: cmp %rdx,%rcx
2.63 โ140b2: โ jb 140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
0.04 โ140b4: โ jmp 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
24.54 โ140b6: test %rbp,%rbp
0.00 โ140b9: โ je 140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
0.32 โ140bb: add $0x1,%rcx
5.10 โ140bf: mov 0x0(%rbp),%ebp
3.31 โ140c2: add %rbp,%rax
โ140c5: cmp %rdx,%rcx
1.40 โ140c8: โ jb 14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>
@rust-lang/cargo, maybe cargo bench should default to one codegen-unit?
If cargo bench uses one CGU but cargo build --release uses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.
Yeah, I was thinking that too after I wrote that. --release really sounds like it should be the profile configured for best possible performance. We need an additional opt-dev profile that has "good enough" performance.
@michaelwoerister today Cargo uses separate profiles for cargo build --release and cargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.
However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also https://github.com/rust-lang/rfcs/pull/2282 which suggests adding user-defined profiles.
The question of whether or not to have an opt-dev profile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.
In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?
Still, it seems like there is ultimately maybe a need for three profiles:
Probably though this conversation should be happening somewhere else. =)
Probably though this conversation should be happening somewhere else
Yeah, this is closely connected to the work on revamping profiles in Cargo.
I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with Chain::fold appearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.
Most helpful comment
@nikomatsakis you mean in libstd? Neither
Chain::foldnoru64::sumhave#[inline]attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.But I hope we don't have to generally recommend
#[inline]all over the place...