Lighthouse: Improve BenchmarkIndex to land device targeting

Created on 30 May 2019 · 19Comments · Source: GoogleChrome/lighthouse

Summary
@exterkamp brought up some inconsistencies in our ULTRADUMB™ benchmark. We need to fix these before landing #6162.

P1 variability

Source

patrickhulce

Most helpful comment

So some good stuff came out of this and we might have finally accomplished the title of the issue :)

tl;dr - V8 team gave us advice on how to tweak our microbenchmark to be more resilient, a simple average of the two tweaked benchmarks now correlates with JS execution time better than any other JS benchmark tested, and they're even going to add it to their waterfall to be alerted about major changes to it 🎉

Root Cause

The bimodality appears to be caused by GC heuristics used by Chrome. The identified V8 CL changes the page size which normally increases GC performance but in Chrome's slow path cause far more GC interruptions.

The Fix

Our benchmark creates a string length of 100k which just pushes past a threshold that causes this crazy slow GC path. By reporting the iterations on a shorter string of length 10k and dividing the resulting index by 10 we end up with nearly identical benchmark results to the fast path on our length 100k string but now we always fall into the fast GC path :)

The Improvement

The allocation/GC-dependence of this benchmark was sort of a feature since cheap devices tend to struggle with memory ops, but V8 team suggested trying a benchmark that preallocates an array of 100k and just copies elements into it. By combining the results of this tweaked benchmark with our previous one, we actually get a new benchmark that correlates with JS execution time on sites better than every other web benchmark we've tested and is only beaten out by GeekBench 🎉

I'll open a PR for the tweaked combo benchmark and we can continue with our previous plans :)

patrickhulce on 13 Aug 2020

🚀1 🎉1

All 19 comments

After discussion and further investigation at a local Best Buy, the inconsistencies he found come down to the fact that Chrome on Windows criminally underperforms on this specific benchmark compared to Edge and similarly spec'd Chromebook or Mac devices.

My proposal for moving forward to address all concerns:

Do not use any sort of auto-detection like #6162 in consistent environments like PSI. This was never the intention of the benchmark and I agree with all the concerns @exterkamp has raised here. Separate tuning of our exact throttling multiplier for PSI to better account for the longer task lengths we see there is still advised.
Maintain the policy of #6162 that 750+ is all fast and will not be adjusted.
ULTRADUMB™ was never intended to distinguish between a 10-30% perf differential and these fluctuations on devices all within the "fast" device class are not a concern.
[ ] Investigate if the significant underperformance of Chrome in all JS benchmarks on Windows is at all reflective of reduced JS performance. (AI @patrickhulce)
[ ] Develop a slimmed down version of the Regexp benchmark from Octane 2.0, and use this in combination with or instead of ULTRADUMB™. Regexp is the only observable JS benchmark tested that actually maintains its differentiability on Windows. The only reason that it was not used to begin with is that it is quite large at 125KB. (AI @patrickhulce)

I have added a few notable device stats to the benchmarks datasheet that illustrates the difference between Chrome on Windows, Edge on Windows, and Chrome on *nix.

patrickhulce on 30 May 2019

👍1

OK, so after far too much time with Chrome on Windows and questions from blue shirts, I've concluded it just has different characteristics that even vary significantly by processor arch that are too difficult to flawlessly identify with a single, small benchmark.

My new proposal would be to loosely abandon the automatic throttling multiplier selection and just add a runWarning if the default throttling settings are being used and the BenchmarkIndex is ~500 or lower that the machine might be underpowered so you might want to adjust the multiplier. This would solve the borderline variance issue, lessen the impact of a mistaken identification, and pull nice double duty of warning in situations like #9691. Unfortunately this kind of leaves DevTools users on weaker machines in the cold as we're removing the throttling customization ability (though they're likely already kinda broken being in this position but not knowing about it).

Ideas for solutions:

Offer a separate sticky DevTools calibration process that saves a calibrated multiplier that's used for all runs moving forward that can be more expensive/comprehensive than a brief ULTRADUMB on every run. (Could also be a nice bonus to do in CLI environments where accuracy matters :))
Live with occasional false positive warnings in DevTools and push weaker machine users toward CLI/PSI

Any feelings here @exterkamp?

patrickhulce on 25 Sep 2019

@exterkamp how do you feel about the above proposal?

Specifically.

Add a runWarning if the default throttling settings are being used and BenchmarkIndex is <500 that the machine might be underpowered.
Write a doc that summarizes how one might calibrate the correct throttling multiplier to use... and come up with how we suggest doing this :)
Eventually implement this calibration mechanism in DevTools to make the warning go away

patrickhulce on 14 Jan 2020

I like the idea, but I wonder if the calibration mechanism might need to come first to help alleviate our Lightrider benchmark problems? It's not unusual for an LR machine to cross below 500. Would we need to prioritize fleet optimization before we can start to surface this kind of warning if we think <500 is a huge problem.

I'm not a huge fan of surfacing reports from PSI that have a runWarning about the machine being slow. So I like the idea, but it seems like a non-starter to me?

exterkamp on 27 Feb 2020

You're saying we improve the variability of existing BenchmarkIndex in LR first?
Once we feel better about that, then we can explore the LH side of changes?

this wfm.

paulirish on 27 Feb 2020

I'm not sure any of this applies to LR. From the beginning I was already on board with basically ignoring any of this in LR since it's already using hard-coded thresholds, consistent hardware, different config, etc. It's still worth exploring how seriously slow the LR hardware is for better calibration (have we tried running https://browserbench.org/Speedometer2.0/ in WRS?), but I wasn't trying to propose we randomly add runWarnings to PSI results :)

patrickhulce on 28 Feb 2020

_Disclaimer_: I think we might be trying to handle two diff problems. I have no problems with the above idea for benchmark index in general, I'm just thinking about the problem space in LR now.

You're saying we improve the variability of existing BenchmarkIndex in LR first?

IMO yes (or in parallel), I think that we should focus on getting that to be +/- X points of some value, then it is worth it to try to start making a better benchmark index for LR at least.

consistent hardware

At scale nothing is consistent. We have LR runs on <100 benchmark index. It happens. So I think that this might need to be dealt with before we can do anything from an LR side based on benchmark index. Or is the idea that we can eventually calibrate to any power of machine?

I wasn't trying to propose we randomly add runWarnings to PSI results :)

It's also unfair to give ourselves a pass. I think we should have some retry logic in PSI maybe if the index is out of spec, so that we don't surface bad results, but we also don't give ourselves a pass? #variance

Maybe something like this?

start a run
run benchmark index
<500
bail out
connect to a new WRS and try again

Would be easier if we were asynchronous 😉 ⌚️

exterkamp on 28 Feb 2020

Pretty much all agreed. We've conflated two separate issues multiple times in this journey, haha, maybe we could split this to track those separate efforts?

I think the situation in LR will end up needing to be handled completely differently. We should actually have significantly greater control over the flow there, retry availability, advanced knowledge about what hardware we should be seeing, different level of user actionability, etc. It's just a completely different ballgame than being randomly invoked a single time in a completely unknown environment. To be clear, I wasn't trying to suggest we give up on LR variance, just that my suggestions thusfar have been separate from whatever we do there.

patrickhulce on 28 Feb 2020

👍1

my next steps here:

develop the hybrid, longer benchmark that works on windows+mac into a script that someone could run
add docs in throttling guide on how to run that script and save the benchmark for usage with --throtlling.cpuSlowdownMultiplier

future steps (lower priority):

offer to compute and automatically save this benchmark first time CLI is run (like the sentry prompt)
enable devtools to run and save this multiplier

patrickhulce on 23 Jun 2020

I have been really struggling with...

develop the hybrid, longer benchmark that works on windows+mac into a script that someone could run

No benchmark I have tested so far (Richards, Deltablue, crypto, Raytrace, EarleyBoyer, Regexp, Splay, NavierStokes, pdf.js, Mandreel, CodeLoad, zlib, typescript, Octane 2.0, Speedometer 2.0, Geekbench 4.0, Geekbench 5.0, and ULTRADUMB) can accurately capture how much script execution a visit to theverge.com will increase by.

Some very interesting data thusfar though. It turns out a lot has changed in 4 years and the correct multiplier for a modern 2020 macbook down to a Moto G4 is more like 10x throttling, not 4x throttling. This might be a larger conversation regarding our targets and whether we want to truly match a moto G4 or just ballpark of "a mobile phone"

I've updated the benchmark stats spreadsheet with the data

patrickhulce on 14 Jul 2020

👀2

What we discussed in the meeting today:

No available benchmark today accurately captures the real-world multiplier of JS execution cross-device. GeekBench and Speedometer are the closest but they underpredict the correct multiple by ~64% on the Windows desktop example above and overpredict on the Moto G4 example by ~42%.
We can't really be in the business of trying to create better benchmarks than GeekBench and Speedometer 2.0.
Therefore, we can't really create a script that magically finds the multiplier out cross-environment variance.
What we can do is...
- Document the "target device" based on the baseline device specs. As of today that would be a Moto G4 and a GCP n1-standard-2 instance (which is what we base our Lantern accuracy numbers on).
- Document the spreadsheet data above into a rough matrix of "Device A with default multipliers translates roughly to a Device B"
- Collect more CPU and memory information about the device to add to BenchmarkIndex diagnostic data
- Provide basic warnings in the report if the diagnostic information indicates the device is significantly faster/slower than the baseline device that links to our documentation.

Specific action items to consider this closed:

[x] Collect the diagnostic data for n1-standard-2 machines to see where we fall today
[x] Convert the spreadsheet data into a matrix to add to our variability.md
[ ] Add a toplevel warning to Lighthouse when BenchmarkIndex is significantly underpowered

patrickhulce on 28 Jul 2020

👍1

To make this even more fun...

Chrome Canary m86 has a significant regression in BenchmarkIndex performance (~2x on my Macbook) which I bisected to r787210, a v8 roll of 8.6.106. It contains several memory related changes, so seems reasonable our memory allocation-based benchmark would be affected.

Given that this had been stable for over 2 years, it's definitely unfortunate to have such a massive change now. I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.

If yes, then great, we can assume it will continue to be stable and we helped find a bug.
If no, then crap, we need to re-evaluate our life choices if there's no hope of BenchmarkIndex staying even order-of-magnitude consistent.

patrickhulce on 10 Aug 2020

If yes, then great, we can assume it will continue to be stable and we helped find a bug.

Wow, benchmark index might actually be the most useful useless performance index!

connorjclark on 11 Aug 2020

The plot thickens.

tl;dr

BenchmarkIndex is now bimodal, sometimes ~66% lower than the old value, sometimes higher.
Affects most devices and OSes
On the first page load after tab creation (Lighthouse CLI's situation and my bisect method), BenchmarkIndex was never observed to be the higher value across 1100 trials.
The longer a tab lives the more likely it seems to be that the BenchmarkIndex flips to the higher value, but it can still flip back, a page required 3-10 refreshes before observing the higher value.

It appears the benchmark became bimodal in this same change. It's unclear to me what determines which bucket of the distribution happens, other than it's some attribute of a toplevel task that affects it. On the first page load after the tab is created it's always the lower number (which is exactly the situation Lighthouse CLI finds itself in). For example, I can run BenchmarkIndex 50 times for 25s straight in a loop and get all 50 values at the low number and immediately refresh for another 50 times for 25s straight and get all 50 values at the high number. They are never mixed within the same task which is a very different effect from normal CPU contention from other processes on the machine (which manifests itself as a lower score within the same task while the CPU is occupied with other tasks and then returns to normal when CPU becomes available). If however, I break up the benchmark index into different tasks, I can observe the bimodal behavior.

The devices I've retested so far...

Macbook Pro: ~1500 to ~700, ~5% of the time it's ~1800
Mac Mini: ~1100 to ~450, ~10% of the time it's ~1400
Samsung S10: ~800 to ~550, ~30% of the time it's ~1400
Middle-level Windows Desktop: ~1200 to ~400, 20% of the time ~1300
Moto G4: ~35 to ~35, very very rarely it's 160 but on both Canary and Stable

Distribution of Benchmark Index

Stable New Tab

Canary New Tab

Stable Refreshed Tab (long-lived, same tab used for all trials)

Canary Refreshed Tab (short-lived, ~just a few refreshes for new tab creation)

Canary Refreshed Tab (long-lived, same tab used for all trials)

Just to be sure my bisect was correct and the lower value never happens I did 1000 more trials with the "Canary New Tab" to be safe

patrickhulce on 11 Aug 2020

I think we should ask the v8 team if this is a signal of anything bad for real-world perf and it should be fixed.

FWIW if doing this, it would be best to send ultradumbBenchmark on its own so it's runnable in d8. I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).

The generated optimized code is identical (modulo memory addresses), the optimization timing appears to be the same, and GC time seems to only change by up to 10% in a quick profile, so it'll be interesting to hear what changed and if there was an intentional tradeoff for more realistic code/allocation/whatever.

brendankenny on 11 Aug 2020

I'm able to reproduce a regression on my machine of about 20% between d8 from 8.6.105 and 8.6.106 and it has remained slower since, including in the latest build (8.6.342).

Fascinating I actually observe the opposite on my machine. Using v8 8.6.106 alone yields the higher bucket value which is ~15% faster than v8 8.6.105.

Repro Script

cat > benchmark.js <<EOF
function ultradumbBenchmark() {
  const start = Date.now();
  let iterations = 0;

  while (Date.now() - start < 500) {
    let s = ''; // eslint-disable-line no-unused-vars
    for (let j = 0; j < 100000; j++) s += 'a';

    iterations++;
  }

  const durationInSeconds = (Date.now() - start) / 1000;
  return Math.round(iterations / durationInSeconds);
}

console.log(ultradumbBenchmark());
EOF

npm install -g jsvu
jsvu [email protected]
~/.jsvu/engines/v8-8.6.105/v8-8.6.105 benchmark.js
jsvu [email protected]
~/.jsvu/engines/v8-8.6.106/v8-8.6.106 benchmark.js
jsvu [email protected]
~/.jsvu/engines/v8-8.6.342/v8-8.6.342 benchmark.js

patrickhulce on 11 Aug 2020

Maybe this combined with the bimodality suggests it's something specific with the way Chrome is running v8? If there are alternate modes or flags that can be flipped?

patrickhulce on 11 Aug 2020