The order in which code is located in binaries has an influence on how fast the binary executes because (as I understand it) it affects instruction cache locality and how efficiently the code is paged in from disk. Many linkers support specifying this order (e.g. LLD via --symbol-ordering-file and MSVC via -ORDER). The hard part, though, is to find an order that will actually improve things. The chromium project has a tool for thisand somewhere else I've read that valgrind could be used for this too. The expected speedups are a few percent.
Prerequisites:
rustc (if using the chromium tool) similar to what GCC's -finstrument-functions does.The first point shouldn't be too hard. The rest, however, would big a big infrastructure investment. I hope that we'll get PGO support for our CI at some point. This symbol ordering business could then be part of that.
cc @glandium @rust-lang/wg-compiler-performance @rust-lang/infra
For your reference, Git uses their integration tests as a source of PGO.
Missing slash at the end in the link to cygprofile (should be https://cs.chromium.org/chromium/src/tools/cygprofile/) without it I get an error.
@ishitatsuyuki Interesting!
As an alternative to the google tool, there is BOLT by facebook (github link).
Great find, @est31!
(This was originally typed in response to https://github.com/rust-lang/rust/issues/55137 which has been closed as a duplicate of this issue)
I think the blocker historically for BOLT/PGO/LTO has been finding CI time, especially in the case of BOLT and PGO for gathering profile data. I think if the answer to "Can BOLT be run on a different binary from which we've gathered data for? (e.g., stage1/bin compiler is profiled while building stage2/bin compiler and then stage2/bin compiler is optimized?" is yes -- and there's still benefit from this -- then my next question is "how long does BOLT take?"
If someone would be willing to do the research to answer these questions then I think integrating this into CI would become more feasible. One good thing is that we can likely not worry about implementing this for all platforms at once since AFAICT BOLT is "just" an optimization
@Mark-Simulacrum I don't think this necessarily needs to involve CI at all. I envision these tools as useful for the artifacts that we distribute to users, rather than as an aid to rustc developers. Seems like it could just be the final step on the build servers while we're doing releases.
Well, our CI is Rust's build server, so in that regard that's why time especially is important.
I tried BOLT with my own build, and it performed 3% better on average. This was a rough benchmark since I'm using my laptop though, so it might be just noise. (I'm probably not going to run this again until I get a workstation.)
BOLT has some caveats:
As for gathering data, maybe running them on rustc-perf is another option? We can make use of its perf support.
Most helpful comment
@Mark-Simulacrum I don't think this necessarily needs to involve CI at all. I envision these tools as useful for the artifacts that we distribute to users, rather than as an aid to rustc developers. Seems like it could just be the final step on the build servers while we're doing releases.