Summary
@exterkamp brought up some inconsistencies in our ULTRADUMB™ benchmark. We need to fix these before landing #6162.
After discussion and further investigation at a local Best Buy, the inconsistencies he found come down to the fact that Chrome on Windows criminally underperforms on this specific benchmark compared to Edge and similarly spec'd Chromebook or Mac devices.
My proposal for moving forward to address all concerns:
750+ is all fast and will not be adjusted.I have added a few notable device stats to the benchmarks datasheet that illustrates the difference between Chrome on Windows, Edge on Windows, and Chrome on *nix.
OK, so after far too much time with Chrome on Windows and questions from blue shirts, I've concluded it just has different characteristics that even vary significantly by processor arch that are too difficult to flawlessly identify with a single, small benchmark.
My new proposal would be to loosely abandon the automatic throttling multiplier selection and just add a runWarning if the default throttling settings are being used and the BenchmarkIndex is ~500 or lower that the machine might be underpowered so you might want to adjust the multiplier. This would solve the borderline variance issue, lessen the impact of a mistaken identification, and pull nice double duty of warning in situations like #9691. Unfortunately this kind of leaves DevTools users on weaker machines in the cold as we're removing the throttling customization ability (though they're likely already kinda broken being in this position but not knowing about it).
Ideas for solutions:
Any feelings here @exterkamp?
@exterkamp how do you feel about the above proposal?
Specifically.
runWarning if the default throttling settings are being used and BenchmarkIndex is <500 that the machine might be underpowered.I like the idea, but I wonder if the calibration mechanism might need to come first to help alleviate our Lightrider benchmark problems? It's not unusual for an LR machine to cross below 500. Would we need to prioritize fleet optimization before we can start to surface this kind of warning if we think <500 is a huge problem.
I'm not a huge fan of surfacing reports from PSI that have a runWarning about the machine being slow. So I like the idea, but it seems like a non-starter to me?
You're saying we improve the variability of existing BenchmarkIndex in LR first?
Once we feel better about that, then we can explore the LH side of changes?
this wfm.
I'm not sure any of this applies to LR. From the beginning I was already on board with basically ignoring any of this in LR since it's already using hard-coded thresholds, consistent hardware, different config, etc. It's still worth exploring how seriously slow the LR hardware is for better calibration (have we tried running https://browserbench.org/Speedometer2.0/ in WRS?), but I wasn't trying to propose we randomly add runWarnings to PSI results :)
_Disclaimer_: I think we might be trying to handle two diff problems. I have no problems with the above idea for benchmark index in general, I'm just thinking about the problem space in LR now.
You're saying we improve the variability of existing BenchmarkIndex in LR first?
IMO yes (or in parallel), I think that we should focus on getting that to be +/- X points of some value, then it is worth it to try to start making a better benchmark index for LR at least.
consistent hardware
At scale nothing is consistent. We have LR runs on <100 benchmark index. It happens. So I think that this might need to be dealt with before we can do anything from an LR side based on benchmark index. Or is the idea that we can eventually calibrate to any power of machine?
I wasn't trying to propose we randomly add runWarnings to PSI results :)
It's also unfair to give ourselves a pass. I think we should have some retry logic in PSI maybe if the index is out of spec, so that we don't surface bad results, but we also don't give ourselves a pass? #variance
Maybe something like this?
Would be easier if we were asynchronous 😉 ⌚️
Pretty much all agreed. We've conflated two separate issues multiple times in this journey, haha, maybe we could split this to track those separate efforts?
I think the situation in LR will end up needing to be handled completely differently. We should actually have significantly greater control over the flow there, retry availability, advanced knowledge about what hardware we should be seeing, different level of user actionability, etc. It's just a completely different ballgame than being randomly invoked a single time in a completely unknown environment. To be clear, I wasn't trying to suggest we give up on LR variance, just that my suggestions thusfar have been separate from whatever we do there.
my next steps here:
future steps (lower priority):
I have been really struggling with...
develop the hybrid, longer benchmark that works on windows+mac into a script that someone could run
No benchmark I have tested so far (Richards, Deltablue, crypto, Raytrace, EarleyBoyer, Regexp, Splay, NavierStokes, pdf.js, Mandreel, CodeLoad, zlib, typescript, Octane 2.0, Speedometer 2.0, Geekbench 4.0, Geekbench 5.0, and ULTRADUMB) can accurately capture how much script execution a visit to theverge.com will increase by.
Some very interesting data thusfar though. It turns out a lot has changed in 4 years and the correct multiplier for a modern 2020 macbook down to a Moto G4 is more like 10x throttling, not 4x throttling. This might be a larger conversation regarding our targets and whether we want to truly match a moto G4 or just ballpark of "a mobile phone"
I've updated the benchmark stats spreadsheet with the data

What we discussed in the meeting today:
Specific action items to consider this closed:
To make this even more fun...
Chrome Canary m86 has a significant regression in BenchmarkIndex performance (~2x on my Macbook) which I bisected to r787210, a v8 roll of 8.6.106. It contains several memory related changes, so seems reasonable our memory allocation-based benchmark would be affected.
Given that this had been stable for over 2 years, it's definitely unfortunate to have such a massive change now. I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.
If yes, then great, we can assume it will continue to be stable and we helped find a bug.
Wow, benchmark index might actually be the most useful useless performance index!
The plot thickens.
tl;dr
It appears the benchmark became bimodal in this same change. It's unclear to me what determines which bucket of the distribution happens, other than it's some attribute of a toplevel task that affects it. On the first page load after the tab is created it's always the lower number (which is exactly the situation Lighthouse CLI finds itself in). For example, I can run BenchmarkIndex 50 times for 25s straight in a loop and get all 50 values at the low number and immediately refresh for another 50 times for 25s straight and get all 50 values at the high number. They are never mixed within the same task which is a very different effect from normal CPU contention from other processes on the machine (which manifests itself as a lower score within the same task while the CPU is occupied with other tasks and then returns to normal when CPU becomes available). If however, I break up the benchmark index into different tasks, I can observe the bimodal behavior.
The devices I've retested so far...
Stable New Tab

Canary New Tab

Stable Refreshed Tab (long-lived, same tab used for all trials)

Canary Refreshed Tab (short-lived, ~just a few refreshes for new tab creation)

Canary Refreshed Tab (long-lived, same tab used for all trials)


I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.
FWIW if doing this, it would be best to send ultradumbBenchmark on its own so it's runnable in d8. I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).
The generated optimized code is identical (modulo memory addresses), the optimization timing appears to be the same, and GC time seems to only change by up to 10% in a quick profile, so it'll be interesting to hear what changed and if there was an intentional tradeoff for more realistic code/allocation/whatever.
I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).
Fascinating I actually observe the opposite on my machine. Using v8 8.6.106 alone yields the higher bucket value which is ~15% faster than v8 8.6.105.
Repro Script
cat > benchmark.js <<EOF
function ultradumbBenchmark() {
const start = Date.now();
let iterations = 0;
while (Date.now() - start < 500) {
let s = ''; // eslint-disable-line no-unused-vars
for (let j = 0; j < 100000; j++) s += 'a';
iterations++;
}
const durationInSeconds = (Date.now() - start) / 1000;
return Math.round(iterations / durationInSeconds);
}
console.log(ultradumbBenchmark());
EOF
npm install -g jsvu
jsvu [email protected]
~/.jsvu/engines/v8-8.6.105/v8-8.6.105 benchmark.js
jsvu [email protected]
~/.jsvu/engines/v8-8.6.106/v8-8.6.106 benchmark.js
jsvu [email protected]
~/.jsvu/engines/v8-8.6.342/v8-8.6.342 benchmark.js
Maybe this combined with the bimodality suggests it's something specific with the way Chrome is running v8? If there are alternate modes or flags that can be flipped?
Using v8 8.6.106 alone yields the _higher_ bucket value which is ~15% faster than v8 8.6.105.
whoops, missed that the result is iterations / durationInSeconds. I see an improvement as well, then (~20%).
So some good stuff came out of this and we might have finally accomplished the title of the issue :)
tl;dr - V8 team gave us advice on how to tweak our microbenchmark to be more resilient, a simple average of the two tweaked benchmarks now correlates with JS execution time better than any other JS benchmark tested, and they're even going to add it to their waterfall to be alerted about major changes to it 🎉
The bimodality appears to be caused by GC heuristics used by Chrome. The identified V8 CL changes the page size which normally increases GC performance but in Chrome's slow path cause far more GC interruptions.
| Before | After (slow) | After (Fast) |
| -- | -- | -- |
| 34% time spent in GC | 71% time spent in GC | 24% time spent in GC |
|
|
|
|
Our benchmark creates a string length of 100k which just pushes past a threshold that causes this crazy slow GC path. By reporting the iterations on a shorter string of length 10k and dividing the resulting index by 10 we end up with nearly identical benchmark results to the fast path on our length 100k string but now we always fall into the fast GC path :)
The allocation/GC-dependence of this benchmark was sort of a feature since cheap devices tend to struggle with memory ops, but V8 team suggested trying a benchmark that preallocates an array of 100k and just copies elements into it. By combining the results of this tweaked benchmark with our previous one, we actually get a new benchmark that correlates with JS execution time on sites better than every other web benchmark we've tested and is only beaten out by GeekBench 🎉
I'll open a PR for the tweaked combo benchmark and we can continue with our previous plans :)
Most helpful comment
So some good stuff came out of this and we might have finally accomplished the title of the issue :)
tl;dr - V8 team gave us advice on how to tweak our microbenchmark to be more resilient, a simple average of the two tweaked benchmarks now correlates with JS execution time better than any other JS benchmark tested, and they're even going to add it to their waterfall to be alerted about major changes to it 🎉
Root Cause
The bimodality appears to be caused by GC heuristics used by Chrome. The identified V8 CL changes the page size which normally increases GC performance but in Chrome's slow path cause far more GC interruptions.
| Before | After (slow) | After (Fast) |
|
|
|
| -- | -- | -- |
| 34% time spent in GC | 71% time spent in GC | 24% time spent in GC |
|
The Fix
Our benchmark creates a string length of 100k which just pushes past a threshold that causes this crazy slow GC path. By reporting the iterations on a shorter string of length 10k and dividing the resulting index by 10 we end up with nearly identical benchmark results to the fast path on our length 100k string but now we always fall into the fast GC path :)
The Improvement
The allocation/GC-dependence of this benchmark was sort of a feature since cheap devices tend to struggle with memory ops, but V8 team suggested trying a benchmark that preallocates an array of 100k and just copies elements into it. By combining the results of this tweaked benchmark with our previous one, we actually get a new benchmark that correlates with JS execution time on sites better than every other web benchmark we've tested and is only beaten out by GeekBench 🎉
I'll open a PR for the tweaked combo benchmark and we can continue with our previous plans :)