Creating an issue for https://internals.rust-lang.org/t/help-test-out-thinlto/6017/3 for more investigation.
cc @cristicbz
We should definitely re-investigate this with different settings for -Z inline-in-all-cgus since I don't think that existed yet when the initial benchmark was done.
Ok here's what I did:
rustc 1.22.0-nightly (4e9527cf6 2017-10-16)Build entire project from scratch
Here I just ran cargo build from a completely clean target directory and timed how long it took.
| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
| ------|---|----|--|--|--|------|---|
| debug #1 | 79.414 | 68.783 | 64.883 | 65.293 | 64.567 | 64.743 | 66.243
| debug #2 | 80.465 | 70.181 | 65.977 | 64.997 | 65.252 | 65.869 | 66.912
| release thinlto=no #1 | 175.212 | 131.287 | 117.496 | 114.854 | 114.018 | 114.566 | 114.391
| release thinlto=no #2 | 171.829 | 131.514 | 119.942 | 115.897 | 112.881 | 113.379 | 114.711
| release thinlto=yes #1 | 172.188 | 174.210 | 161.224 | 159.588 | 154.327 | 154.382 | 161.319
| release thinlto=yes #2 | 171.179 | 172.778 | 160.880 | 156.812 | 157.029 | 156.233 | 156.742
The conclusion from this is that I get clear wins across the board on my machine for using more CGUs. The sweet spot appears around 16 CGUs, and I don't see any clear/obvious regressions in this area for using multiple CGUs.
Build project after changing the math crate
Here I changed the math crate to force a recompilation of that crate and up. All other dependencies were cached
| cgus | 1 | 2 | 4 | 8 | 16 | 32 | 64 |
| ------|---|----|--|--|--|------|---|
| debug #1 | 18.573 | 14.590 | 12.318 | 11.827 | 11.733 | 11.936 | 12.206
| debug #2 | 18.558 | 14.533 | 12.246 | 11.904 | 11.659 | 11.991 | 12.250
| release thinlto=no #1 | 59.602 | 35.658 | 23.906 | 22.403 | 21.920 | 21.862 | 21.932
| release thinlto=no #2 | 60.016 | 35.438 | 24.806 | 22.935 | 21.835 | 21.914 | 21.902
| release thinlto=yes #1 | 59.449 | 46.711 | 33.257 | 33.333 | 31.071 | 31.026 | 31.190
| release thinlto=yes #2 | 60.491 | 47.160 | 33.320 | 33.021 | 31.036 | 30.964 | 30.347
Here we see even more drastic wins (as is expected) and again no clear regressions from one CGU.
@michaelwoerister
I haven't done much data collection of inline-in-all-cgus in release mode, but @sfackler posted some worrisome numbers which indicate that -Z inline-in-all-cgus=no actually regresses release build time. Presumably it shifts more work to post-LTO than the work takes in pre-LTO?
In the thread about testing ThinLTO some main conclusions I had was:
#[inline] in release mode.With that in mind I'm sort of tempted to close this and keep pushing hard on #45320, but I'm curious if others can reproduce my own results!
I'm doing another run with the newest nightly, more values for the number of cgus and inline-in-all-cgus. I'll report back for posterity's sake.
I've personally never been able to reproduce any regressions on my machine, I've always seen wins across the board
I want that engraved on my tombstone :P
Fair point!
So using your script (thanks btw!), this what I've got (with rustc 1.22.0-nightly (4e9527cf6 2017-10-16)):
|config / cgus|1|2|4|8|16|32|
|-|-|-|-|-|-|-|
|fresh, debug #1|84.241|78.396|66.445|64.908|66.476|67.808|67.748|
|fresh, debug #2|84.017|70.373|66.309|66.053|65.734|68.415|68.423|
|fresh, release #1|164.202|135.773|129.870|122.565|114.855|116.007|126.889|
|fresh, release #2|163.273|143.971|121.138|124.187|113.137|115.537|116.154|
|fresh, release, thinlto #1|169.568|171.795|163.108|178.536|156.117|158.511|157.820|
|fresh, release, thinlto #2|165.670|176.277|162.612|173.914|155.648|165.749|164.328|
|math, debug #1|20.781|15.920|14.698|14.141|14.056|14.349|14.977|
|math, debug #2|20.737|15.901|14.717|14.056|14.101|14.400|14.708|
|math, release #1|64.861|37.807|25.958|25.829|23.400|23.247|23.679|
|math, release #2|64.856|37.211|25.930|26.070|23.292|23.264|23.413|
|math, release, thinlto #1|62.796|51.280|35.819|36.367|32.440|32.596|32.907|
|math, release, thinlto #2|71.395|50.464|36.009|35.762|32.796|33.840|33.499|
This results match yours more closely than before, with modest or no improvements in the CGU 8-32 regime for the full rebuild and significant speed-ups in the single crate case. There is no difference in runtime benchmarks.
So what happened before? I'll put it down to thermal throttling: all runs took much longer, and maybe with more cores in use things got hotter. Sorry for the noise, it sounds like you can close this issue!
Ok awesome! Thanks so much for gathering that data! In that case I'll...
Most helpful comment
I'm doing another run with the newest nightly, more values for the number of
cgusandinline-in-all-cgus. I'll report back for posterity's sake.I want that engraved on my tombstone :P