Rust: Non-deterministic perf.rust-lang.org runs (e.g. syn-opt).

Created on 11 Feb 2020 · 29Comments · Source: rust-lang/rust

https://perf.rust-lang.org/detailed-query.html?commit=e369f6b617bc5124ec5d02626dc1c821589e6eb3&base_commit=4d1241f5158ffd66730e094d8f199ed654ed52ae&benchmark=syn-opt&run_name=patched%20incremental:%20println

This perf run was expected to be a no-op, but appears to have resulted in slightly less queries being run (2 def_span and 2 metadata reads).

cc @eddyb

A-reproducibility C-bug T-compiler

Source

Mark-Simulacrum

Most helpful comment

One week update, it's pretty stable now, and we can clearly see changes that are several times smaller than the variance used to be:

eddyb on 21 Apr 2020

🎉7

All 29 comments

This may have been happening a lot, if anyone has the time, going through the history of syn-opt runs would be useful.

It's far noisier than it should be IMO, and non-determinism might play a significant role.

eddyb on 11 Feb 2020

Something interesting I noticed, apologies if this is already known;

Perhaps this is LLVM optimization related?

wesleywiser on 12 Feb 2020

It seems possible that the queries are run from codegen, but since they're queries, I believe the problem is within Rust code (to my knowledge we don't have callbacks or anything like that from LLVM).

Mark-Simulacrum on 12 Feb 2020

It might be a combination. LLVM is known to have trouble maintaining deterministic performance, in a similar vein to rustc hashing pointers, but potentially even affecting output for LLVM optimziations.

I highly advise disabling at least userspace ASLR, and manually check (by dumping /proc/self/maps from rustc? or even interned e.g. Ty pointers) that the heap addresses match between two runs, even with a slightly different rustc.
There's also the question of randomness sensitivity, but I suspect we don't want that even in real builds.

IIRC @emberian managed to get the nondeterminism down to a bimodal (i.e. two possible profiles) situation for "irsy" (Is Rust Slim Yet), back in the day.
It's a bit of a shame that the "irsy" lessons were lost in time.

eddyb on 13 Feb 2020

Something suspicious when looking at the query view is one less of each metadata_decode_entry and def_span. That should be impossible in many situations, yet it shows up.
Something is triggering slightly different compilation and I'm dying to know what.

It's possible all benchmarks are affected and syn-opt just has the biggest swing.

eddyb on 19 Mar 2020

@wesleywiser is pointing out this effect of one less of each metadata_decode_entry and def_span also affects cranelift-codegen-debug clean incremental (and probably other crates as well).

This suggests to me this might be a very specific source of non-determinism in rustc or a proc macro.

eddyb on 19 Mar 2020

I've done some local experiments and I can't cause this on nightly, even by doing this:

rm -rf target/release
RUSTFLAGS=-Cmetadata=$N cargo rustc --release -- -Zself-profile
mkdir perf-$N
mv syn-*.{events,string_{data,index}} perf-$N

for several values of $N, then running summarize diff and looking at the counts.

Next thing I'm going to do is build a local rustc with channel = "nightly", which will hopefully include a git hash, and then add an empty git commit to get it to change.
My assumption is that Cargo will embed that into std's -Cmetadata and that will trigger this issue.

EDIT: nope, that doesn't trigger it either, but I can keep trying.

eddyb on 26 Mar 2020

Hm, FWIW perf does not use nightly channel, but maybe you can try the above we pre-produced artifacts? (We can look at some PR changing unimportant things like README or triagebot.toml, if they exist given rollups).

But building locally should work too, you may need to disable ignore-git though in config.toml.

Mark-Simulacrum on 26 Mar 2020

@Mark-Simulacrum pointed out that some paths may differ, but building two copies of syn with the same compiler, while it does result in different artifacts (since --remap-path-prefix wasn't used), the query counts do stay the same.

eddyb on 26 Mar 2020

Just realized I've been doing all my testing without CARGO_INCREMENTAL=1, so if this requires both release mode and incremental builds, I might've missed it.

eddyb on 26 Mar 2020

I was able to reproduce this ~~once, but I'm not sure I know the relevant factors~~, I had:

different rustc -vV output (but built from the same source)
~~different build dirs for syn~~ (EDIT: not necessary)
--release mode
CARGO_INCREMENTAL=1

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| metadata_decode_entry | +30.06691ms | +73.60% | +37.357079ms | +63.56% | +3 | +0 | +0ns | +0ns |
| param_env | +2.88433ms | +26.61% | +5.023573ms | +29.19% | +1 | +0 | +0ns | +0ns |
| def_span | +560.266µs | +20.41% | +574.635µs | +19.42% | +1 | +0 | +0ns | +0ns |
| associated_item_def_ids | +205.239µs | +23.59% | +319.72µs | +17.13% | +1 | +0 | +0ns | +0ns |
| issue33140_self_ty | +37.4µs | +121.78% | +48.729µs | +113.77% | +1 | +0 | +0ns | +0ns |
| trait_def | +9.161µs | +10.86% | +16.081µs | +12.47% | +1 | +0 | +0ns | +0ns |

eddyb on 26 Mar 2020

This seems to repro as well:

mkdir perf-{1,2}
rm perf-{1,2}/*
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-3-stage1 rustc --release -- -Zself-profile=perf-1
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-4-stage1 rustc --release -- -Zself-profile=perf-2

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| metadata_decode_entry | -51.316108ms | -55.44% | -71.070789ms | -54.85% | +1 | +0 | +0ns | +0ns |
| trait_def | -70.321µs | -47.56% | -102.341µs | -45.63% | +1 | +0 | +0ns | +0ns |

Those builds only differ by an empty git commit and have channel = "nightly" so the git hash shows up in rustc +rust-{3,4}-stage1 -vV.

I assume that means every crate hash is different.

eddyb on 26 Mar 2020

Which queries are redone seems consistent for a given rustc -vV output, and I went and changed rust-4-stage1 (by adding another empty commit) and got this instead:

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| param_env | +1.44951ms | +14.28% | +1.967982ms | +12.27% | +3 | +0 | +0ns | +0ns |
| metadata_decode_entry | +638.448µs | +1.62% | +752.248µs | +1.33% | +5 | +0 | +0ns | +0ns |
| def_span | +579.528µs | +25.07% | +603.46µs | +24.25% | +3 | +0 | +0ns | +0ns |
| associated_item_def_ids | +354µs | +47.50% | +364.602µs | +22.20% | +1 | +0 | +0ns | +0ns |
| specializes | +87.26µs | +22.44% | +155.46µs | +22.14% | +3 | +0 | +0ns | +0ns |
| issue33140_self_ty | +57.021µs | +226.63% | +61.661µs | +161.50% | +3 | +0 | +0ns | +0ns |
| trait_def | +16.969µs | +27.16% | +26.67µs | +25.86% | +1 | +0 | +0ns | +0ns |

So we clearly have some sort of dependency on that version, maybe via HashStable?
(Since it affects every single def path hash via the crate hashes computed by Cargo)

If we can't fix this, there's technically a potential workaround by making try builds lie about their git hash in rustc -vV output, but I don't like it.

eddyb on 26 Mar 2020

Looks like the difference is that for that last pair of runs, the one with extra query counts uses:

associated_item_def_ids for core::slice::SlicePartialEq
trait_def for core::slice::BytewiseEquality
def_span, param_env and issue33140_self_ty for:
- core::slice::{{impl}}[127]: https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5793-L5795
- core::slice::{{impl}}[128]: https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5807-L5809
- alloc::vec::{{impl}}[30]: https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/liballoc/vec.rs#L2004-L2006
specializes for:
- (alloc::string::{{impl}}[16],alloc::string::{{impl}}[77]): https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/liballoc/string.rs#L1854 ($lhs = String, $rhs = str) https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/liballoc/string.rs#L1869
- (alloc::string::{{impl}}[77],alloc::string::{{impl}}[16])
- (alloc::string::{{impl}}[79],alloc::string::{{impl}}[16])
- (alloc::string::{{impl}}[79],alloc::string::{{impl}}[77])
- (alloc::string::{{impl}}[86],alloc::string::{{impl}}[16])
- (alloc::string::{{impl}}[86],alloc::string::{{impl}}[77])
- ~~I haven't looked up any of the above~~ (EDIT: they seem to all be PartialEq impls, with impl_eq! invocations producing {{impl}}[77] to {{impl}}[86], inclusive)
- (alloc::vec::{{impl}}[30],alloc::vec::{{impl}}[31]): https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/liballoc/vec.rs#L2004-L2006 https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/liballoc/vec.rs#L2035-L2037
- (core::slice::{{impl}}[127],core::slice::{{impl}}[129]): https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5793-L5795 https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5825-L5827
- (core::slice::{{impl}}[128],core::slice::{{impl}}[129]) https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5807-L5809 https://github.com/rust-lang/rust/blob/f4c675c476c18b1a11041193f2f59d695b126bc8/src/libcore/slice/mod.rs#L5825-L5827

This is troubling, since it suggests there's some non-determinism in specialization.

To get query keys I've added -Zself-profile-events=default,query-keys to the cargo rustc commands then used the crox tool to get chrome_profiler.json files.

I've used commands like these to extract the differing query keys:

jq -s 'map(map(select(.name == "def_span")) | map(.args.arg0) | sort) | (.[1]-.[0])' perf-{1,2}/chrome_profiler.json

that example outputs:

[
  "alloc::vec::{{impl}}[30]",
  "core::slice::{{impl}}[127]",
  "core::slice::{{impl}}[128]"
]

I've then counted the impls in those modules by hand, ignoring #[derive]s and macro invocations (had to redo some counting because I didn't realize at first that I had to skip them).

eddyb on 26 Mar 2020

This iterates over an FxHashMap:

https://github.com/rust-lang/rust/blob/2fbb07525e2f07a815e780a4268b11916248b5a9/src/librustc_trait_selection/traits/specialize/specialization_graph.rs#L217

jonas-schievink on 26 Mar 2020

I guess the order this loop iterates in matters because of side-effects? https://github.com/rust-lang/rust/blob/2fbb07525e2f07a815e780a4268b11916248b5a9/src/librustc_trait_selection/traits/specialize/specialization_graph.rs#L95

eddyb on 26 Mar 2020

Sorting possible_siblings going into that loop seems to fix the problem, but...

This is the FxHashMap: https://github.com/rust-lang/rust/blob/2fbb07525e2f07a815e780a4268b11916248b5a9/src/librustc/traits/specialization_graph.rs#L64

And SimplifiedType only has DefId leaves (and usize for some counts, irrelevant here).

So if SimplifiedType differ, that means the DefId differs, that means there's no way to use ordering to get the same result from two different-version compilers.
If we sort by stable hash, that's definitely going to be different since it takes crate metadata into account, which takes compiler version into account (AFAIK).

~~IndexMap might work, if we know we'll be inserting in a stable ordering, but that requires e.g. the caller of insert to also be stable.~~

EDIT: see https://github.com/rust-lang/rust/issues/69060#issuecomment-604914176.

eddyb on 27 Mar 2020

As I was afraid, there's explicit sorting upstream, and I'm not sure it helps: https://github.com/rust-lang/rust/blob/2fbb07525e2f07a815e780a4268b11916248b5a9/src/librustc_trait_selection/traits/specialize/mod.rs#L308-L318

eddyb on 27 Mar 2020

@michaelwoerister @Zoxc Do you know what happens to DefId allocation for crates with the same contents but different -C metadata (so different def path root hash)?
I'm trying to figure out if sorting by DefIndex makes sense, in terms of being stable across compiler version changes (assuming the compiler and standard library do not otherwise change).

EDIT: see https://github.com/rust-lang/rust/issues/69060#issuecomment-604914176.

eddyb on 27 Mar 2020

I'm starting to believe this is the culprit: https://github.com/rust-lang/rust/blob/b9ac29169ac541a64471b83d8019cd0b442f8065/src/librustc_metadata/rmeta/encoder.rs#L1454-L1457

The impls are already in a stable order: that in which they appear within the crate.
But the DefPathHashes aren't stable across compilers, so this injects a non-deterministic order.

That is, I believe DefIndex is stable across compilers (for unchanged source), meaning that FxHashMap, while potentially a problem in theory, isn't a problem in practice.

EDIT: it does seem that removing that sort (with no other changes) fixes the non-determinism for me.
EDIT: that sort was introduced by https://github.com/rust-lang/rust/commit/77b7df33077aedfe8df903f95165872243f4acd4 in #41911, but since then, incremental compilation has changed how cross-crate information is tracked, so it should be fine to remove it IMO.

eddyb on 27 Mar 2020

🚀1

@Mark-Simulacrum If we want to avoid changing the compiler at all, there might be a way to guarantee stability of DefPathHashes across compiler versions, on perf.rust-lang.org:

redirect Cargo's use of rustc to a wrapper that changes the output of -vV to hide the commit hash entirely, making the before/after look identical to Cargo
- note that the version inside rustc doesn't matter itself, since it's only what Cargo computes from rustc -vV that results in different DefPathHashes between compiler versions
for std, use a Xargo-like setup (or -Z build-std, although I suspect that might be more expensive)
- this will result in a std build that has the the effects from 1. on DefPathHashes
- bonus: you can track the impact on compiler or std changes on std's own compile times

Unlike working around individual between-compiler-versions instabilities, this also has the advantage that anything working with DefPathHashes will have the same performance.

Combine that with disabling ASLR, and the allocation + hashing behavior should be fully stable.

eddyb on 27 Mar 2020

From https://github.com/rust-lang/rust/pull/70462#issuecomment-605342285:

Nothing I've looked at has any query count differences, which is nice, but there's still a lot of noise from codegen/LLVM, so that needs to be looked into separately. I suppose I was wrong to assume that the query counts and noisy codegen/LLVM were related.

So we'd need to investigate the codegen/LLVM non-determinism separately. And that will be harder since it's all timing differences, not a discrete quantity query counts.

@Mark-Simulacrum What might be useful is grabbing -Zself-profile data from multiple runs, with the same compiler and with compilers that only differ in version commit hash.
If the codegen/LLVM time changes show up only in the latter case, then the -Z build-std trick should help there, but if even the same compiler is noisy, we'll need to dig deeper.

EDIT: I wonder if CGU partitioning can vary wildly and that's causing unpredictable codegen/LLVM performance. -Zprint-mono-items=lazy definitely shows that rustc -vV differences lead to different CGU "names", but the output format makes it hard to determine the shape of the CGUs.

eddyb on 27 Mar 2020

I had some spare time today and deployed the rustc -vV change -- that should now be canonical starting with 5b594d6bbb9d492633d0f48e9d6c8a864bea2f3d (currently the last commit on perf.rlo). From what I understood talking to @eddyb, we will almost certainly also need something like -Zbuild-std to sort out the nondeterminism fully, but I didn't have the time today to figure that out. Either way probably good to try and observe the effects separately, too.

Mark-Simulacrum on 28 Mar 2020

IIUC, the last 3 points horizontally (on both black and blue lines) are post-https://github.com/rust-lang/rustc-perf/pull/645:
(looks pretty flat to me)

eddyb on 16 Apr 2020

We're not out of the water yet, this perf diff has a bunch of query count differences (42 for the top ones, syn-opt itself looks to be 112) and I'm not sure why it would.

Nothing in https://github.com/rust-lang/rust/compare/d2230290f7220e740ec08f4d844bf5951e1b74b8...ce1ab355c27faa84f984b15800cd6a65e674b9f2 seems like it could explain that.

eddyb on 16 Apr 2020

One week update, it's pretty stable now, and we can clearly see changes that are several times smaller than the variance used to be:

eddyb on 21 Apr 2020

🎉7

So, given that the version number has a measurable impact on performance... (a) I guess this means we'll see a bump in these graphs on each version number bump, and (b) should we do some optimization to find the version number that makes the compiler go fastest? :P

RalfJung on 21 Apr 2020

😄1

@RalfJung The version gets in via -C metadata, so you could instead just XOR the resulting combined hash with some constant (but let's not, see more below).

I don't think rust-lang/cargo#8073 lets any part of the version into -C metadata, but I could be wrong.
@ehuss is the person who can answer that best.

So why do we get such wild non-determinism for syn-opt and maybe a couple other benchmarks, when -C metadata differs?

Well, it's always in LLVM. Sure, there's some query counts that vary, but those are tiny and ultimately irrelevant.

The other hint is that this happens in syn-opt but not syn-debug. So, LLVM optimizations.

One theory is that syn-opt is sensitive to CGU partitioning, and more so than the average crate we track on perf.rust-lang.org. That is, LLVM gets to do more or less optimizations based on which functions are instantiated in the same CGU.

However, I seem to recall from experimentation that it's not actually the case, or at least I couldn't find differences in CGU partitioning. Which leads to an even simpler explanation:

CGU ordering alone impacts the total time spent in LLVM that much.

So if you want to make rustc slightly faster, don't waste your time with the hashes.
Write a smarter CGU partitioning (or even just sorting) algorithm.
(If that's even possible, since you'd have to anticipate a lot of what LLVM might do...)

eddyb on 21 Apr 2020

Maybe I should have made it clearer (I hoped the ":P" was enough but maybe not), but I was joking in (b) above. ;)

RalfJung on 21 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings