A larger meta-issue on Gecko: https://bugzilla.mozilla.org/show_bug.cgi?id=1416082
I've done a pass through the MotionMark performance with RenderDoc on Windows 10 + AMD machine. Interestingly, for most cases the actual workload GPU times are minuscule (!) compared to what other actors (such as Angle) are doing behind our back. This may be due to relatively good GPU in place (RX 460). Anyway, the main paths to optimizations I'd expect to come from Gecko WR layer and figuring out how to please Angle...
WR issues:
Bugzilla issues:
CopySubresourceRegion - 1419891cc @jrmuizel @glennw
@kvark Is it a reasonable conclusion from looking at those bugs that the GPU time in WR is a small portion relative to the full screen rects and clears coming via angle? And if so, we should focus on investigating the angle issues first, to get the major performance wins before we look into the WR-specific issues?
Oh, you basically said as much in the first paragraph :) OK, I'll do some investigating of these issues. Thanks for writing them up!
@glennw yes, this is correct in the context of a good GPU. The text-rendering test was most offending in terms of actual content, other things are surely dominated by Angle weirdness.
On various devices (both windows and linux) I see a lot of time spent clearing/rendering things in intermediate targets that were already present in the last frames. I assume that the ANGLE fixes you mentioned will alleviate some of that cost, but more aggressively caching things that we render in intermediate targets should help a ton as well, and even remove the need to render into intermediate targets at all for a lot of frames.
We can certainly cache items that we draw to intermediate surfaces - it's not a huge amount of work. We will definitely want to do it for power-saving reasons too, even if not performance.
However, I think we need to do deeper investigations of the clear / surface costs as well - many of the issues are simple cases to fix - for example, double clears, or slow path shaders, or redundant work being done.
Relatedly - I just had an idea for GPUs where clearing is expensive - not fully thought through yet. Could we do something like:
@glennw this is going to be less effective than the clear masks the driver is supposed to manage anyway, and prevent the driver from doing so... We may try this on different HW/platform to see where it makes sense today.
Yea, in theory it shouldn't be faster than what most GPUs / drivers do - but might be an interesting experiment to try on some hardware.
i7-3770k + GeForce GTX 970 at 1080p in Windows 10 (commit 5bcb7f4):
||Nightly 2018-04-17|Chrome 66.0.3359.117|
|-------------|-------------|-------------|
|Total|326.53 (±2.54%)|*347.97 (±5.75%)|
|Multiply|296.34 (±1.63%)|562.19 (±0.95%)|
|Canvas Arcs|1244.86 (±0.77%)|394.60 (±1.07%)|
|Leaves|557.55 (±1.32%)|640.38 (±34.30%)|
|Paths|3231.18 (±1.15%)|1094.80 (±2.08%)|
|Canvas Lines|2086.48 (±1.99%)|3340.63 (±0.14%)|
|Focus|95.27 (±4.93%)|67.65 (±1.78%)|
|Images|40.79 (±1.87%)|94.21 (±1.51%)|
|Design|60.93 (±3.58%)|75.25 (±2.21%)|
|Suits|128.53 (±7.71%)|300.15 (±2.38%)*|
Most helpful comment
2083, #2085, #1648, and 1419863 can be marked fixed now (btw, 1419863 links to 1419871, and vice versa).
i7-3770k + GeForce GTX 970 at 1080p in Windows 10 (commit 5bcb7f4):
||Nightly 2018-04-17|Chrome 66.0.3359.117|
|-------------|-------------|-------------|
|Total|326.53 (±2.54%)|*347.97 (±5.75%)|
|Multiply|296.34 (±1.63%)|562.19 (±0.95%)|
|Canvas Arcs|1244.86 (±0.77%)|394.60 (±1.07%)|
|Leaves|557.55 (±1.32%)|640.38 (±34.30%)|
|Paths|3231.18 (±1.15%)|1094.80 (±2.08%)|
|Canvas Lines|2086.48 (±1.99%)|3340.63 (±0.14%)|
|Focus|95.27 (±4.93%)|67.65 (±1.78%)|
|Images|40.79 (±1.87%)|94.21 (±1.51%)|
|Design|60.93 (±3.58%)|75.25 (±2.21%)|
|Suits|128.53 (±7.71%)|300.15 (±2.38%)*|