Rust: 2x benchmark loss in rayon-hash from multiple codegen-units

Created on 22 Jan 2018 · 12Comments · Source: rust-lang/rust

I'm seeing a huge slowdown in rayon-hash benchmarks, resolved with -Ccodegen-units=1.

$ rustc -Vv
rustc 1.25.0-nightly (97520ccb1 2018-01-21)
binary: rustc
commit-hash: 97520ccb101609af63f29919bb0a39115269c89e
commit-date: 2018-01-21
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 4.0

$ cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 5.51 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   2,295,348 ns/iter (+/- 152,025)
test rayon_set_sum_serial   ... bench:   7,730,830 ns/iter (+/- 171,552)
test std_set_sum_parallel   ... bench:  10,038,209 ns/iter (+/- 188,189)
test std_set_sum_serial     ... bench:   7,733,258 ns/iter (+/- 134,850)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.48 secs
     Running target/release/deps/set_sum-833cf161cf760670

running 4 tests
test rayon_set_sum_parallel ... bench:   1,092,732 ns/iter (+/- 105,979)
test rayon_set_sum_serial   ... bench:   6,152,751 ns/iter (+/- 83,103)
test std_set_sum_parallel   ... bench:   8,957,618 ns/iter (+/- 132,791)
test std_set_sum_serial     ... bench:   6,144,775 ns/iter (+/- 75,377)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

rayon_set_sum_parallel is the showcase for this crate, and it suffers the most from CGU.

From profiling and disassembly, this seems to mostly be a lost inlining opportunity. In the slower build, the profile is split 35% bridge_unindexed_producer_consumer, 34% Iterator::fold, 28% Sum::sum, and the hot loop in the first looks like:

 16.72 │126d0:   cmpq   $0x0,(%r12,%rbx,8)
  6.73 │126d5: ↓ jne    126e1 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x201>
 16.65 │126d7:   inc    %rbx
  0.00 │126da:   cmp    %rbp,%rbx
  7.20 │126dd: ↑ jb     126d0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f0>
  0.05 │126df: ↓ jmp    1272f <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x24f>
 26.93 │126e1:   mov    0x0(%r13,%rbx,4),%eax
  4.26 │126e6:   movq   $0x1,0x38(%rsp)
  2.27 │126ef:   mov    %rax,0x40(%rsp)
  1.88 │126f4:   mov    %r14,%rdi
  4.58 │126f7: → callq  15b90 <<u64 as core::iter::traits::Sum>::sum>
  0.68 │126fc:   movq   $0x1,0x38(%rsp)
  2.58 │12705:   mov    %r15,0x40(%rsp)
  0.62 │1270a:   movq   $0x1,0x48(%rsp)
  0.31 │12713:   mov    %rax,0x50(%rsp)
  0.49 │12718:   movb   $0x0,0x58(%rsp)
  2.50 │1271d:   xor    %esi,%esi
  0.41 │1271f:   mov    %r14,%rdi
  0.85 │12722: → callq  153f0 <<core::iter::Chain<A, B> as core::iter::iterator::Iterator>::fold>
  1.30 │12727:   mov    %rax,%r15
  2.16 │1272a: ↑ jmp    126d7 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1f7>

With CGU=1, 96% of the profile is in bridge_unindexed_producer_consumer, with this hot loop:

  2.28 │1426d:   test   %rbx,%rbx
       │14270: ↓ je     14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  5.40 │14272:   mov    (%rbx),%ebx
  2.75 │14274:   add    %rbx,%rax
  1.47 │14277:   lea    (%rdx,%rsi,4),%rbx
  0.21 │1427b:   nopl   0x0(%rax,%rax,1)
 29.56 │14280:   cmp    %rdi,%rsi
  0.04 │14283: ↓ jae    14296 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x146>
  2.87 │14285:   add    $0x4,%rbx
 20.22 │14289:   cmpq   $0x0,(%rcx,%rsi,8)
  1.48 │1428e:   lea    0x1(%rsi),%rsi
  8.00 │14292: ↑ je     14280 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x130>
 25.25 │14294: ↑ jmp    1426d <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x11d>

A-codegen C-enhancement I-slow T-compiler

Source

cuviper

Most helpful comment

@nikomatsakis you mean in libstd? Neither Chain::fold nor u64::sum have #[inline] attributes today, but usually that's not needed for generic functions. I guess being generic means they didn't (couldn't) get pre-compiled in libstd, but I suppose they must have gotten separate codegen units in the end.

But I hope we don't have to generally recommend #[inline] all over the place...

cuviper on 25 Jan 2018

👍4

All 12 comments

@cuviper I wonder if we can add any #[inline] hints to help resolve this?

nikomatsakis on 25 Jan 2018

But I hope we don't have to generally recommend #[inline] all over the place...

cuviper on 25 Jan 2018

👍4

I'm not entirely sure this isn't expected; 2x is a bit much though. We might need to wait on the heuristics that @michaelwoerister has planned.

Mark-Simulacrum on 25 Jan 2018

@cuviper I'm not sure that being generic affects their need for an inlining hint. It is true that those functions will be instantiated in the downstream crate, but I think they are still isolated into their own codegen unit. (ThinLTO is of course supposed to help here.) I'm not sure what's the best fix though.

nikomatsakis on 26 Jan 2018

FWIW, the gap has closed some (perhaps from LLVM 6?), with parallel codegen now "only" 40% slower:

$ rustc -Vv
rustc 1.25.0-nightly (4d2d3fc5d 2018-02-13)
binary: rustc
commit-hash: 4d2d3fc5dadf894a8ad709a5860a549f2c0b1032
commit-date: 2018-02-13
host: x86_64-unknown-linux-gnu
release: 1.25.0-nightly
LLVM version: 6.0

$ cargo bench --bench set_sum
    Updating registry `https://github.com/rust-lang/crates.io-index`
   Compiling [...]
    Finished release [optimized] target(s) in 5.37 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,420,138 ns/iter (+/- 34,709)
test rayon_set_sum_serial   ... bench:   7,718,603 ns/iter (+/- 141,127)
test std_set_sum_parallel   ... bench:   8,886,208 ns/iter (+/- 137,102)
test std_set_sum_serial     ... bench:   7,845,670 ns/iter (+/- 106,685)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

$ RUSTFLAGS=-Ccodegen-units=1 cargo bench --bench set_sum
   Compiling [...]
    Finished release [optimized] target(s) in 6.40 secs
     Running target/release/deps/set_sum-d85cde9ece817246

running 4 tests
test rayon_set_sum_parallel ... bench:   1,089,784 ns/iter (+/- 167,357)
test rayon_set_sum_serial   ... bench:   6,196,760 ns/iter (+/- 335,661)
test std_set_sum_parallel   ... bench:   8,497,259 ns/iter (+/- 128,929)
test std_set_sum_serial     ... bench:   6,151,935 ns/iter (+/- 76,954)

test result: ok. 0 passed; 0 failed; 0 ignored; 4 measured; 0 filtered out

The profile of rayon_set_sum_parallel is now 49.5% bridge_unindexed_producer_consumer, 48% Sum::sum, with this hot loop:

 19.74 │14fa0:   cmpq   $0x0,(%r12,%rbx,8)
  6.54 │14fa5: ↓ jne    14fb2 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1a2>
 21.06 │14fa7:   add    $0x1,%rbx
       │14fab:   cmp    %rbp,%rbx
  4.04 │14fae: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>
  0.04 │14fb0: ↓ jmp    14fde <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x1ce>
 28.36 │14fb2:   mov    0x0(%r13,%rbx,4),%eax
  4.34 │14fb7:   movq   $0x1,0x58(%rsp)
  5.45 │14fc0:   mov    %rax,0x60(%rsp)
  0.84 │14fc5:   mov    %r14,%rdi
  3.80 │14fc8: → callq  16380 <<u64 as core::iter::traits::Sum>::sum>
  2.52 │14fcd:   add    %rax,%r15
  2.07 │14fd0:   add    $0x1,%rbx
       │14fd4:   cmp    %rbp,%rbx
  0.26 │14fd7: ↑ jb     14fa0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x190>

versus codegen-units=1:

  0.63 │14092:   lea    (%rdi,%rcx,4),%rbp
  0.01 │14096:   nopw   %cs:0x0(%rax,%rax,1)
 25.16 │140a0:   cmpq   $0x0,(%rsi,%rcx,8)
  7.68 │140a5: ↓ jne    140b6 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x176>
 28.34 │140a7:   add    $0x1,%rcx
  0.21 │140ab:   add    $0x4,%rbp
       │140af:   cmp    %rdx,%rcx
  2.63 │140b2: ↑ jb     140a0 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x160>
  0.04 │140b4: ↓ jmp    140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
 24.54 │140b6:   test   %rbp,%rbp
  0.00 │140b9: ↓ je     140ce <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x18e>
  0.32 │140bb:   add    $0x1,%rcx
  5.10 │140bf:   mov    0x0(%rbp),%ebp
  3.31 │140c2:   add    %rbp,%rax
       │140c5:   cmp    %rdx,%rcx
  1.40 │140c8: ↑ jb     14092 <rayon::iter::plumbing::bridge_unindexed_producer_consumer+0x152>

cuviper on 14 Feb 2018

@rust-lang/cargo, maybe cargo bench should default to one codegen-unit?

michaelwoerister on 15 Feb 2018

If cargo bench uses one CGU but cargo build --release uses multiple CGUs, benchmark results are less representative of the code that actually runs at the end of the day.

hanna-kruppe on 15 Feb 2018

Yeah, I was thinking that too after I wrote that. --release really sounds like it should be the profile configured for best possible performance. We need an additional opt-dev profile that has "good enough" performance.

michaelwoerister on 15 Feb 2018

@michaelwoerister today Cargo uses separate profiles for cargo build --release and cargo bench, but, by default, they use the same options. So it is possible to override number of cgus for benchmarks specifically manually.

However, long-term there is a desire to phase out all special profiles except release and dev (at least in the narrow sense of profiles, as "flags for the compiler"). See also https://github.com/rust-lang/rfcs/pull/2282 which suggests adding user-defined profiles.

matklad on 15 Feb 2018

The question of whether or not to have an opt-dev profile and so forth has been around for some time. My current opinion is that we can probably keep it two options, but I would do this by altering the "debug" profile: instead of doing no optimization, it should do some minimal amount of optimization. It will take some time to determine precisely what that means, but I strongly suspect that we can get "decently performing code" by doing some mild inlining and few other minor optimizations. So the default might be something more like -O1 or -Og.

In that case, we might consider some other profiles (e.g., no optimization at all, or mild optimization), but I sort of suspect that the use cases are thinner. In my experience, the only reason to want -O0 was for debuginfo, and let's premise that our default profile keeps debuginfo intact. Something like opt-dev I guess corresponds to "I don't care about debuginfo but i'm not benchmarking" -- that might be better served by customizing the "default' build settings for you project?

Still, it seems like there is ultimately maybe a need for three profiles:

super-debug: no opts, I want perfect debuginfo
debug (default): do some optimization. By default, preserves debuginfo, but maybe you can "tune it up" if you're rather have faster code.
release: pull out all the stops.

Probably though this conversation should be happening somewhere else. =)

nikomatsakis on 15 Feb 2018

👍1

Probably though this conversation should be happening somewhere else

Yeah, this is closely connected to the work on revamping profiles in Cargo.

aturon on 15 Feb 2018

I just benchmarked as part of updating to rayon-1.0, and it's back to 2x slowdown, with Chain::fold appearing in the profile again. I guess the prior inlining improvement was just lucky, whether they're grouped in the same codegen unit.

cuviper on 15 Feb 2018

Was this page helpful?

0 / 5 - 0 ratings