Rust: 32 codegen units may not always be better at -O0

Created on 30 Sep 2017 · 21Comments · Source: rust-lang/rust

Continuing discussion from https://github.com/rust-lang/rust/pull/44853#issuecomment-332248240

T-compiler regression-from-stable-to-nightly

Source

alexcrichton

Most helpful comment

Testing with winapi 0.3 on a Ryzen 7 1700X.

| codegen-units | Time |
|-|-|
| 1 | 15.17 |
| 2 | 14.63 |
| 4 | 15.18 |
| 8 | 15.44 |
| 16 | 14.68 |
| 32 | 14.89 |
| 64 | 15.22 |
| 128 | 15.22 |
| 256 | 15.92 |
| 512 | 17.34 |
| 1024 | 20.24 |
| 2048 | 27.12 |
| 4096 | 40.74 |
| 8192 | 69.23 |

retep998 on 30 Sep 2017

😄7

All 21 comments

@mersinvald unfortunately the perf files you mentioned I can't read locally, and the perf diff output isn't as illuminating as I hoped it would be.

I wonder, could you try investigating if one crate in particular takes much longer? Maybe take some of the crates that take longest to compile (sorting the rlibs by size is typically a rough measurement for this) and see how long they take?

I'm hoping that this is one crate hitting an accidentally pathological case, but it may also just be a case where everything slows down with 32 codegen units (even though we don't intend that).

One other possiblility may be the linker? (like ld) That should show up though as the final crate taking much longer to finish in 32 cgus than 2 in theory.

alexcrichton on 30 Sep 2017

I tested compiling regex with different number of codegen units on my 2 core Pentium E2200.
Here are the results:

1:  33.4
2:  28.76
4:  28.90
8:  29.57
16: 29.97
32: 31.0

It gets a decent win with the number of codegen units equal the number of cores, but gets increasingly slower with more than that.

Here are the results for rustfmt. These are a bit more wild. It's actually faster on 4 units than 2, but gets increasingly slower after that, up to the point where 32 units is actually slower than 1 unit.

crumblingstatue on 30 Sep 2017

Testing with winapi 0.3 on a Ryzen 7 1700X.

retep998 on 30 Sep 2017

😄7

There is a pretty noticable change in compile times on perf.rust-lang.org with the commit that changed the default number as well.

oyvindln on 1 Oct 2017

I'm going to tag this as a regression until we can figure out what the spike is

alexcrichton on 4 Oct 2017

Ok looking more into the data from perf.rust-lang.org. We're measuring instructions executed and the "cpu-clock" measurement. Instructions certainly don't have much bearing on wall time (but serve as great indicators of when to look for a regression) and apparently cpu-clock is not the same as wall time. I believe the cpu-clock stat is the total sum of all threads on all cpus.

I'm trying to rationalize the numbers we're seeing and it's not quite making sense. The regressions shown on perf.rust-lang.org imply to me that if I pin the process to one CPU we'll see that much increase in compile time, but I see virtually no increase in compile time locally. In fact, if I pin cargo/rustc to one CPU it typically gets faster with more cgus presumably because the objects are a little quicker to create overall.

Basically I'm having a lot of difficulty reproducing any slowdowns here. Assistance investigating why things are slowing down as opposed to just "things are slowing down" would be much appreciated.

alexcrichton on 5 Oct 2017

I believe these are all the reported regressions from this change

alexcrichton on 5 Oct 2017

Some more investigation

I found locally I was able to reproduce some slowdowns with the syntex_syntax crate:

| cgus | 1 | 2 | 4 | 8 | 16 | 32 |
|----------|----|--|--|--|----|---|
| wall time | 32.9 | 24.9 | 22.8 | 21.4 | 22.0 | 21.8 |
| cpu-clock | 33064 | 35218 | 40584 | 44581 | 48542 | 51788 |
| 1 core wall time | 34.3 | 35.2 | 36.4 | 38.1 | 40.3 | 41.7 |

Here "wall time" is the wall time it took to compile the syntex_syntax crate on an 8 core cpu. The "cpu-clock" is the statistic reported by perf state, and "1 core wall time" is the amount of time the compilation takes when pinned to one cpu (taskset -c 0).

There's a clear tend here that parallelization is helping but also taking up more CPU time. Namely increasing the number of codegen units is not purely dividing the work, it's also increasing the total amount of work being done. This becomes particularly noticeable with more codegen units limited to one core on this big crate.

So on one hand there's certainly some overhead from more codegen units. We're shuffling more files around (not copying, just working with), spawning more threads, more messages, more synchronization, etc. I don't think that these factors would cause such a drastic increase in compile times, however. Instead what I think may be happening here (that I have just now remembered) is how we treat #[inline].

@michaelwoerister correct me if I'm wrong, but I believe that we'll inline any #[inline] function into all codegen units instead of just one. This means that if you're using String you're probably going to get a copy of each of its functions in all of your codegen units. This to me seems like the root cause of the timing regressions here.

As a result my conclusion is that increasing the number of codegen units increases the amount of work the compiler does, and hence parallelism isn't always enough to recover this loss in the increase of work done. In the worst case with N codegen units you can probably do N times more work, and with less than N cores you'll notice the increase in compile times.

@michaelwoerister do you agree with the conclusions here? If so, do you think we could treat #[inline] differently in debug mode? Is it worth it to solve this? Can you think of another possible solution?

I'm personally hesitant to set the number of codegen units based on the amount of cores on the machine. I think that may return optimal performance (perhaps 2x the number of cores cgus) because the extra work needed for each codegen unit I think should come out in the wash because of parallelization. This, however, would make builds across machines nondeterministic if they've got different numbers of cpus, forcing deterministic compiles to explicitly specify the number of codegen units.

My personal preference for a solution right now is:

Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.
Reduce the number of codegen units from 32 to 16 here if we continue to see problems. I think most machines compiling Rust today have fewer than 16 cores so we should still get some benefit of using as much as possible, while also this should help reduce any extra costs from extra work needed for each codegen unit.

alexcrichton on 5 Oct 2017

@alexcrichton, Yes I think your assumptions and conclusions are correct. I also think that #[inline] functions are the main source of work duplication here. There's also debuginfo, which cannot be shared between codegen units, but that is usually rather cheap afaik.

Don't inline #[inline] into all codegen units in debug mode, only inline into one codegen unit.

We probably mean the same thing here: I would suggest treating them the same as regular polymorphic functions:

Instantiate only "on-demand", and
have exactly one instance and link to that.

That might require some special handling in the symbol hash to avoid conflicts between crates, just as we do for generic functions at the moment:
https://github.com/rust-lang/rust/blob/ed1cffdb21220f633b8b91f099a69fab4ef2b8f4/src/librustc_trans/back/symbol_names.rs#L194-L201

If performance turns out to be unbearable, we maybe could have a heuristic to do some inlining:

if the function body is small, and
if the function does not call other inline functions.

Reduce the number of codegen units from 32 to 16 here if we continue to see problems.

Seems like a sensible approach.

Other solutions?

I wonder if ThinLTO would allow us to completely get rid of duplicating bodies of inline functions across codegen units during trans. That would be awesome for incremental compilation and it would also help in the non-incremental case.

michaelwoerister on 6 Oct 2017

Some excellent news!

I implemented the strategy of "only inline functions into one codegen unit" and the updating timings I get are:

| cgus | 1 | 2 | 4 | 8 | 16 | 32 |
|----------|----|--|--|--|----|---|
| wall time | 30.7 | 25.2 | 21.0 | 19.0 | 18.4 | 18.7 |
| cpu-clock | 30864 | 31387 | 31876 | 32525 | 34341 | 35370 |
| 1 core wall time | 31.2 | 30.8 | 30.8 | 30.6 | 30.7 | 30.6 |

In other words, I think this indeed makes a huge dent in the issue of "more cgus causes rustc to do more work" problem. I'm going to send a PR shortly implementing this.

@michaelwoerister I also think that you're correct in that ThinLTO should eliminate the need for inlining functions into all codegen units unconditionally. Just gotta land ThinLTO and test it out first :)

alexcrichton on 7 Oct 2017

🎉3

Ok I've created a PR at https://github.com/rust-lang/rust/pull/45075

@mersinvald and @crumblingstatue, if y'all could test out that PR locally that'd be great! My hope is that it should make 32 codegen units continue to be better than 1, and perhaps give us flexibility to increase the number of cgus in the future.

alexcrichton on 7 Oct 2017

@mersinvald can you try re-testing with tonight's nightly which should have some improvements?

alexcrichton on 11 Oct 2017

@alexcrichton sure, will do it tomorrow.

mersinvald on 11 Oct 2017

@alexcrichton I've ran the tests on the rustc 1.22.0-nightly (a47c9f870 2017-10-11)

I've also measured the hot build time, i.e only 200LOC crate + link-time.

Cold Build
4:  63.13 63.60 62.84 (avg 63.19)
32: 69.70 70.32 69.76 (avg 69.92)

Hot Build
4:  5.39 5.28 5.28 (avg 5.31)
32: 6.70 6.10 6.10 (avg 6.30)

mersinvald on 12 Oct 2017

@mersinvald ok I've done some further investigation into your benchmark. The setup was:

script that was run
DevHeap/circles-server@3f32a49921a497965462c616a73cf99c9487e9e5 with this patch, I notably found that enabling LTO was messing with the timings in release mode.
rustc 1.22.0-nightly (4e9527cf6 2017-10-16)
8-core (4x cores + hyperthread) + SSD + 32GB RAM
Ubuntu Linux

Results:

Build from scratch

Here I compiled the positions-api crate from scratch with a clean target directory with various settings

| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|------|---|---|---|---|----|----|----|
| debug #0 | 43.368 | 37.003 | 34.790 | 34.254 | 34.376 | 34.287 | 34.849 |
| debug #1 | 43.911 | 37.401 | 34.974 | 33.668 | 34.612 | 34.199 | 34.904 |
| release thinlto=no #0 | 91.289 | 70.930 | 63.766 | 59.373 | 57.378 | 56.717 | 56.278 |
| release thinlto=no #1 | 91.516 | 71.976 | 63.714 | 58.143 | 58.023 | 57.228 | 57.023 |
| release thinlto=yes #0 | 90.799 | 96.447 | 91.650 | 80.653 | 81.806 | 82.241 | 85.829 |
| release thinlto=yes #1 | 90.615 | 97.803 | 89.126 | 83.018 | 81.862 | 82.962 | 84.413 |

There's a suspicious jump at 2 CGUs where the compilation takes longer (with ThinLTO) but it quickly flattens out, the time improving with more and more codegen units thrown at it. Additionally in debug mode we see wins across the board as soon as codegen units are enabled.

Incremental build

Here I compiled the positions-api crate with a preexisting target directory. For these timings only the positions-api crate was rebuilt, effectively measuring the time it takes to compile just that one crate.

| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|------|---|---|---|---|----|----|----|
| debug #0 | 5.147 | 3.945 | 3.326 | 3.129 | 3.257 | 3.265 | 3.604 |
| debug #1 | 5.310 | 3.927 | 3.399 | 3.148 | 3.308 | 3.228 | 3.415 |
| release thinlto=no #0 | 14.114 | 11.626 | 7.808 | 6.741 | 5.145 | 4.751 | 4.438 |
| release thinlto=no #1 | 14.032 | 11.648 | 7.865 | 6.635 | 4.871 | 4.657 | 4.421 |
| release thinlto=yes #0 | 14.121 | 15.107 | 11.088 | 8.836 | 6.876 | 6.368 | 6.366 |
| release thinlto=yes #1 | 14.054 | 14.960 | 10.448 | 7.283 | 7.119 | 6.345 | 6.280 |

Here like before debug builds leveled off pretty quickly, benefitting a good amount from multiple CGUs. Release builds didn't fare so well after 8 CGUs, however, but still show improvements across the board from the original build time with just one CGU. Again though there's a jump at 2 CGUs in release mode which is a little suspicious

@mersinvald can you try running with that script and similar settings to guarantee that the only difference in the data collection here is hardware?

alexcrichton on 18 Oct 2017

Are the incremental crate-only builds 10-15x slower than full non-incremental builds?

lnicola on 18 Oct 2017

oh oops, I mixed up the tables

alexcrichton on 18 Oct 2017

@alexcrichton
I've run tests with the exact same revisions of the compiler and circles-server.

rustc 1.22.0-nightly (4e9527cf6 2017-10-16)
4-core (2x cores + hyperthread) Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
SSD + 8GB RAM
Arch Linux

| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|------|---|---|---|---|----|----|----|
| debug #0 | 7.197 | 5.804 | 5.561 | 5.512 | 5.766 | 5.832 | 6.607 |
| debug #1 | 7.170 | 5.834 | 5.533 | 5.490 | 5.782 | 5.913 | 6.277 |
| release thinlto=no #0 | 18.490 | 15.631 | 13.683 | 10.496 | 9.092 | 9.368 | 9.457 |
| release thinlto=no #1 | 18.496 | 15.671 | 13.846 | 9.143 | 9.124 | 9.391 | 9.496 |
| release thinlto=yes #0 | 18.502 | 20.199 | 18.267 | 12.506 | 13.133 | 13.334 | 13.704 |
| release thinlto=yes #1 | 18.647 | 20.059 | 15.457 | 12.428 | 12.731 | 13.368 | 13.660 |

| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
|------|---|---|---|---|----|----|----|
| debug #0 | 77.231 | 72.293 | 69.728 | 70.492 | 71.270 | 72.801 | 73.855 |
| debug #1 | 76.678 | 70.609 | 70.246 | 69.892 | 72.097 | 72.374 | 73.683 |
| release thinlto=no #0 | 162.873 | 139.603 | 130.557 | 128.175 | 130.154 | 131.654 | 132.660 |
| release thinlto=no #1 | 160.766 | 140.791 | 129.430 | 126.258 | 130.905 | 132.073 | 134.501 |
| release thinlto=yes #0 | 159.064 | 184.526 | 179.635 | 180.172 | 184.242 | 189.370 | 192.278 |
| release thinlto=yes #1 | 157.843 | 181.420 | 180.135 | 178.202 | 186.637 | 191.419 | 192.772 |

mersinvald on 19 Oct 2017

Awesome, thanks @mersinvald!

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

Otherwise it looks like the proposal for turning on multiple cgus by default in release mode means that 16 cgus is universally better than 1 cgu in terms of compile time if you only compile the last crate. If you compile the entire crate graph it looks like multiple cgus regreses the build time?

alexcrichton on 20 Oct 2017

So reading that I'd conclude that in debug mode 16cgus is a unversal win over 1 cgu (although slightly worse than 8/4 cgus in some cases). In that sense, I think we can close this issue?

👍

michaelwoerister on 20 Oct 2017

Ok!

alexcrichton on 20 Oct 2017

Was this page helpful?

0 / 5 - 0 ratings