Rust: Non-deterministic perf.rust-lang.org runs (e.g. syn-opt).

Created on 11 Feb 2020  Â·  29Comments  Â·  Source: rust-lang/rust

https://perf.rust-lang.org/detailed-query.html?commit=e369f6b617bc5124ec5d02626dc1c821589e6eb3&base_commit=4d1241f5158ffd66730e094d8f199ed654ed52ae&benchmark=syn-opt&run_name=patched%20incremental:%20println

This perf run was expected to be a no-op, but appears to have resulted in slightly less queries being run (2 def_span and 2 metadata reads).

cc @eddyb

A-reproducibility C-bug T-compiler

Most helpful comment

One week update, it's pretty stable now, and we can clearly see changes that are several times smaller than the variance used to be:
image

All 29 comments

This may have been happening a lot, if anyone has the time, going through the history of syn-opt runs would be useful.

It's far noisier than it should be IMO, and non-determinism might play a significant role.

Something interesting I noticed, apologies if this is already known;

image

Perhaps this is LLVM optimization related?

It seems possible that the queries are run from codegen, but since they're queries, I believe the problem is within Rust code (to my knowledge we don't have callbacks or anything like that from LLVM).

It might be a combination. LLVM is known to have trouble maintaining deterministic performance, in a similar vein to rustc hashing pointers, but potentially even affecting output for LLVM optimziations.

I highly advise disabling at least userspace ASLR, and manually check (by dumping /proc/self/maps from rustc? or even interned e.g. Ty pointers) that the heap addresses match between two runs, even with a slightly different rustc.
There's also the question of randomness sensitivity, but I suspect we don't want that even in real builds.

IIRC @emberian managed to get the nondeterminism down to a bimodal (i.e. two possible profiles) situation for "irsy" (Is Rust Slim Yet), back in the day.
It's a bit of a shame that the "irsy" lessons were lost in time.

Something suspicious when looking at the query view is one less of each metadata_decode_entry and def_span. That should be impossible in many situations, yet it shows up.
Something is triggering slightly different compilation and I'm dying to know what.

It's possible all benchmarks are affected and syn-opt just has the biggest swing.

@wesleywiser is pointing out this effect of one less of each metadata_decode_entry and def_span also affects cranelift-codegen-debug clean incremental (and probably other crates as well).

This suggests to me this might be a very specific source of non-determinism in rustc or a proc macro.

I've done some local experiments and I can't cause this on nightly, even by doing this:

rm -rf target/release
RUSTFLAGS=-Cmetadata=$N cargo rustc --release -- -Zself-profile
mkdir perf-$N
mv syn-*.{events,string_{data,index}} perf-$N

for several values of $N, then running summarize diff and looking at the counts.


Next thing I'm going to do is build a local rustc with channel = "nightly", which will hopefully include a git hash, and then add an empty git commit to get it to change.
My assumption is that Cargo will embed that into std's -Cmetadata and that will trigger this issue.

EDIT: nope, that doesn't trigger it either, but I can keep trying.

Hm, FWIW perf does not use nightly channel, but maybe you can try the above we pre-produced artifacts? (We can look at some PR changing unimportant things like README or triagebot.toml, if they exist given rollups).

But building locally should work too, you may need to disable ignore-git though in config.toml.

@Mark-Simulacrum pointed out that some paths may differ, but building two copies of syn with the same compiler, while it does result in different artifacts (since --remap-path-prefix wasn't used), the query counts do stay the same.

Just realized I've been doing all my testing without CARGO_INCREMENTAL=1, so if this requires both release mode and incremental builds, I might've missed it.

I was able to reproduce this once, but I'm not sure I know the relevant factors, I had:

  • different rustc -vV output (but built from the same source)
  • different build dirs for syn (EDIT: not necessary)
  • --release mode
  • CARGO_INCREMENTAL=1

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| metadata_decode_entry | +30.06691ms | +73.60% | +37.357079ms | +63.56% | +3 | +0 | +0ns | +0ns |
| param_env | +2.88433ms | +26.61% | +5.023573ms | +29.19% | +1 | +0 | +0ns | +0ns |
| def_span | +560.266µs | +20.41% | +574.635µs | +19.42% | +1 | +0 | +0ns | +0ns |
| associated_item_def_ids | +205.239µs | +23.59% | +319.72µs | +17.13% | +1 | +0 | +0ns | +0ns |
| issue33140_self_ty | +37.4µs | +121.78% | +48.729µs | +113.77% | +1 | +0 | +0ns | +0ns |
| trait_def | +9.161µs | +10.86% | +16.081µs | +12.47% | +1 | +0 | +0ns | +0ns |

This seems to repro as well:

mkdir perf-{1,2}
rm perf-{1,2}/*
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-3-stage1 rustc --release -- -Zself-profile=perf-1
rm -rf target/release
CARGO_INCREMENTAL=1 cargo +rust-4-stage1 rustc --release -- -Zself-profile=perf-2

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| metadata_decode_entry | -51.316108ms | -55.44% | -71.070789ms | -54.85% | +1 | +0 | +0ns | +0ns |
| trait_def | -70.321µs | -47.56% | -102.341µs | -45.63% | +1 | +0 | +0ns | +0ns |

Those builds only differ by an empty git commit and have channel = "nightly" so the git hash shows up in rustc +rust-{3,4}-stage1 -vV.

I assume that means every crate hash is different.

Which queries are redone seems consistent for a given rustc -vV output, and I went and changed rust-4-stage1 (by adding another empty commit) and got this instead:

| Item | Self Time | Self Time Change | Time | Time Change | Item count | Cache hits | Blocked time | Incremental load time |
|-|-|-|-|-|-|-|-|-|
| param_env | +1.44951ms | +14.28% | +1.967982ms | +12.27% | +3 | +0 | +0ns | +0ns |
| metadata_decode_entry | +638.448µs | +1.62% | +752.248µs | +1.33% | +5 | +0 | +0ns | +0ns |
| def_span | +579.528µs | +25.07% | +603.46µs | +24.25% | +3 | +0 | +0ns | +0ns |
| associated_item_def_ids | +354µs | +47.50% | +364.602µs | +22.20% | +1 | +0 | +0ns | +0ns |
| specializes | +87.26µs | +22.44% | +155.46µs | +22.14% | +3 | +0 | +0ns | +0ns |
| issue33140_self_ty | +57.021µs | +226.63% | +61.661µs | +161.50% | +3 | +0 | +0ns | +0ns |
| trait_def | +16.969µs | +27.16% | +26.67µs | +25.86% | +1 | +0 | +0ns | +0ns |

So we clearly have some sort of dependency on that version, maybe via HashStable?
(Since it affects every single def path hash via the crate hashes computed by Cargo)

If we can't fix this, there's technically a potential workaround by making try builds lie about their git hash in rustc -vV output, but I don't like it.

Looks like the difference is that for that last pair of runs, the one with extra query counts uses:

This is troubling, since it suggests there's some non-determinism in specialization.


To get query keys I've added -Zself-profile-events=default,query-keys to the cargo rustc commands then used the crox tool to get chrome_profiler.json files.

I've used commands like these to extract the differing query keys:

jq -s 'map(map(select(.name == "def_span")) | map(.args.arg0) | sort) | (.[1]-.[0])' perf-{1,2}/chrome_profiler.json

that example outputs:

[
  "alloc::vec::{{impl}}[30]",
  "core::slice::{{impl}}[127]",
  "core::slice::{{impl}}[128]"
]

I've then counted the impls in those modules by hand, ignoring #[derive]s and macro invocations (had to redo some counting because I didn't realize at first that I had to skip them).

Sorting possible_siblings going into that loop seems to fix the problem, but...

This is the FxHashMap: https://github.com/rust-lang/rust/blob/2fbb07525e2f07a815e780a4268b11916248b5a9/src/librustc/traits/specialization_graph.rs#L64

And SimplifiedType only has DefId leaves (and usize for some counts, irrelevant here).

So if SimplifiedType differ, that means the DefId differs, that means there's no way to use ordering to get the same result from two different-version compilers.
If we sort by stable hash, that's definitely going to be different since it takes crate metadata into account, which takes compiler version into account (AFAIK).

IndexMap might work, if we know we'll be inserting in a stable ordering, but that requires e.g. the caller of insert to also be stable.

EDIT: see https://github.com/rust-lang/rust/issues/69060#issuecomment-604914176.

@michaelwoerister @Zoxc Do you know what happens to DefId allocation for crates with the same contents but different -C metadata (so different def path root hash)?
I'm trying to figure out if sorting by DefIndex makes sense, in terms of being stable across compiler version changes (assuming the compiler and standard library do not otherwise change).

EDIT: see https://github.com/rust-lang/rust/issues/69060#issuecomment-604914176.

I'm starting to believe this is the culprit: https://github.com/rust-lang/rust/blob/b9ac29169ac541a64471b83d8019cd0b442f8065/src/librustc_metadata/rmeta/encoder.rs#L1454-L1457

The impls are already in a stable order: that in which they appear within the crate.
But the DefPathHashes aren't stable across compilers, so this injects a non-deterministic order.

That is, I believe DefIndex is stable across compilers (for unchanged source), meaning that FxHashMap, while potentially a problem in theory, isn't a problem in practice.

EDIT: it does seem that removing that sort (with no other changes) fixes the non-determinism for me.
EDIT: that sort was introduced by https://github.com/rust-lang/rust/commit/77b7df33077aedfe8df903f95165872243f4acd4 in #41911, but since then, incremental compilation has changed how cross-crate information is tracked, so it should be fine to remove it IMO.

@Mark-Simulacrum If we want to avoid changing the compiler at all, there might be a way to guarantee stability of DefPathHashes across compiler versions, on perf.rust-lang.org:

  1. redirect Cargo's use of rustc to a wrapper that changes the output of -vV to hide the commit hash entirely, making the before/after look identical to Cargo

    • note that the version inside rustc doesn't matter itself, since it's only what Cargo computes from rustc -vV that results in different DefPathHashes between compiler versions

  2. for std, use a Xargo-like setup (or -Z build-std, although I suspect that might be more expensive)

    • this will result in a std build that has the the effects from 1. on DefPathHashes

    • bonus: you can track the impact on compiler or std changes on std's own compile times

Unlike working around individual between-compiler-versions instabilities, this also has the advantage that anything working with DefPathHashes will have the same performance.

Combine that with disabling ASLR, and the allocation + hashing behavior should be fully stable.

From https://github.com/rust-lang/rust/pull/70462#issuecomment-605342285:

Nothing I've looked at has any query count differences, which is nice, but there's still a lot of noise from codegen/LLVM, so that needs to be looked into separately. I suppose I was wrong to assume that the query counts and noisy codegen/LLVM were related.

So we'd need to investigate the codegen/LLVM non-determinism separately. And that will be harder since it's all timing differences, not a discrete quantity query counts.

@Mark-Simulacrum What might be useful is grabbing -Zself-profile data from multiple runs, with the same compiler and with compilers that only differ in version commit hash.
If the codegen/LLVM time changes show up only in the latter case, then the -Z build-std trick should help there, but if even the same compiler is noisy, we'll need to dig deeper.

EDIT: I wonder if CGU partitioning can vary wildly and that's causing unpredictable codegen/LLVM performance. -Zprint-mono-items=lazy definitely shows that rustc -vV differences lead to different CGU "names", but the output format makes it hard to determine the shape of the CGUs.

I had some spare time today and deployed the rustc -vV change -- that should now be canonical starting with 5b594d6bbb9d492633d0f48e9d6c8a864bea2f3d (currently the last commit on perf.rlo). From what I understood talking to @eddyb, we will almost certainly also need something like -Zbuild-std to sort out the nondeterminism fully, but I didn't have the time today to figure that out. Either way probably good to try and observe the effects separately, too.

IIUC, the last 3 points horizontally (on both black and blue lines) are post-https://github.com/rust-lang/rustc-perf/pull/645:
(looks pretty flat to me)
image

We're not out of the water yet, this perf diff has a bunch of query count differences (42 for the top ones, syn-opt itself looks to be 112) and I'm not sure why it would.

Nothing in https://github.com/rust-lang/rust/compare/d2230290f7220e740ec08f4d844bf5951e1b74b8...ce1ab355c27faa84f984b15800cd6a65e674b9f2 seems like it could explain that.

One week update, it's pretty stable now, and we can clearly see changes that are several times smaller than the variance used to be:
image

So, given that the version number has a measurable impact on performance... (a) I guess this means we'll see a bump in these graphs on each version number bump, and (b) should we do some optimization to find the version number that makes the compiler go fastest? :P

@RalfJung The version gets in via -C metadata, so you could instead just XOR the resulting combined hash with some constant (but let's not, see more below).

I don't think rust-lang/cargo#8073 lets any part of the version into -C metadata, but I could be wrong.
@ehuss is the person who can answer that best.


So why do we get such wild non-determinism for syn-opt and maybe a couple other benchmarks, when -C metadata differs?

Well, it's always in LLVM. Sure, there's some query counts that vary, but those are tiny and ultimately irrelevant.

The other hint is that this happens in syn-opt but not syn-debug. So, LLVM optimizations.

One theory is that syn-opt is sensitive to CGU partitioning, and more so than the average crate we track on perf.rust-lang.org. That is, LLVM gets to do more or less optimizations based on which functions are instantiated in the same CGU.

However, I seem to recall from experimentation that it's not actually the case, or at least I couldn't find differences in CGU partitioning. Which leads to an even simpler explanation:

CGU ordering alone impacts the total time spent in LLVM that much.


So if you want to make rustc slightly faster, don't waste your time with the hashes.
Write a smarter CGU partitioning (or even just sorting) algorithm.
(If that's even possible, since you'd have to anticipate a lot of what LLVM might do...)

Maybe I should have made it clearer (I hoped the ":P" was enough but maybe not), but I was joking in (b) above. ;)

Was this page helpful?
0 / 5 - 0 ratings