After doing some recent profiling and optimization work, it's now clear to me where our major bottlenecks are. I've tried to write up a basic summary and plan for fixing these below. Additionally, there's a number of outstanding bugs / issues that should become significantly easier to fix once some of these changes are in place.
A lot of this is quite vague and hand-wavy but hopefully it's useful to give some context to the upcoming changes I'd like to work on.
I don't think solving most of the problems is particularly difficult. The key is working out a way to do them incrementally, so we can continue working on other bugs and issues at the same time.
First, a list of the most obvious performance issues and major bugs I'm aware of:
The DL deserialization time is a big problem. It's often ~50% of the total frame time when I profile. Fixing that is unrelated to the other things I'll mention, but I list it here since it has such a massive potential effect on CPU performance.
Apart from DL time, the major CPU bottleneck is dealing with clip hierarchies and the clip-scroll tree. This has more to do with the size of the clip-scroll tree being larger than (we think) is required - there is some ongoing work in Gecko to try and improve this. Nonetheless, we do a lot of hashing here, and a lot of memory allocations (not in the clip-scroll tree itself, but in processing masks and building clip tasks). We also do a lot of CPU work due to an impedance mismatch between the clip-scroll tree and the batching / shader code (specifically, the way we handle clip-scroll groups, packed layers etc).
Those two major areas listed above typically account for ~80% of the CPU time processing a frame. After those, batch creation is typically the next most expensive area. More details on this in the GPU section.
On the sites we're seeing in Gecko, our draw call counts are higher than ideal. They are still good (typically ~50 per frame), but improving this would be a good win for both CPU time (batching) and GPU time (shader switches etc).
It occurs to me that many of the effects we now have separate shaders for can in fact be unified into a smaller set of shaders. For example, the line decoration shader can just be a rectangle with a clip mask. Similarly, many of the specialized border shaders we have now can be vastly simplified by moving the shaping into a clip mask - often collapsing to a rectangle shader with a clip mask.
The box shadow shader which does an exact evaulation doesn't scale well to large rectangles and/or large blur radii. Instead, we can remove the box shadow shader altogether, and unify this with the text shadow system, doing a traditional separable gaussian blur. There are several optimizations available to make the blur shader significantly faster than it is now, which would also help out both text shadows and box shadows. This will also make it significantly easier to handle all the variations of box shadows that we currently don't handle correctly (there are many bugs in GH about this).
We know that any time we move primitives into the opaque pass, it's typically a big GPU time win. Unfortunately, right now, Gecko is getting no benefit from opaque rectangles that have rounded corners. This is due to a deficiency inside WR - we can only create opaque inner segments when the rounded clip region exists on the item. Servo uses these, but Gecko uses the clip-scroll tree to provide these.
Subpixel text rendering. There is a GL extension present on all hardware except early Intel gen6 that can make subpixel text rendering much faster. We should consider using this when available. We also should consider patching ANGLE to use the D3D version of this extension.
Introduce a new batch primitive type - let's call it a brush type. This can handle solid colors, textures and simple gradients. We use this for rectangles, images, line decorations, text (potentially), simple gradients etc. Having a new batch primitive type allows us to start using a compressed vertex format (approx. half the size of existing), which will help with CPU time, and we can switch over to it incrementally.
Modify how we build the clip chain for a primitive to provide a list of clips in local space, and other space. Effectively this splits the clip hierarchy when the first reference frame is encountered. Modify the clip-scroll tree to build the clip chains once at the start of the frame, rather than per-primitive-run as we do now.
On the brush primitive type, support the concept of "segmenting" the primitive rect based on the presence of any local space clips into opaque and transparent regions. This will mean that Gecko benefits from segment optimizations when using non-item-clips. This would replace the current code that tries to subtract the inner rect from the primitive rect based on the presence of a per-item clip. That code has a couple of problems (in addition to the only working on per-item clips). (a) The way it subtracts rectangles can introduce t-junctions, which can show up as cracks as primitives rotate. (b) We apply it to all rectangles - instead, we should only segment brush primitives based on some criteria (e.g. for small rectangles it's not worth the CPU time to segment them). Further, the new segmenting code can also segment the outer strips of a primitive that is being transformed, so we only need to run the AA shader on the outside edges rather than the whole primitive, as we do now. Since all this work is done only with clips in the same local space, the CPU math is quite trivial. Any clips in a different space simply get rendered into the primitive clip mask.
The brush primitive shader no longer uses packed layers directly. Instead, we modify it to have knowledge of reference frames and scroll frames. The vertex shader handles this work itself, applying scroll offsets to the reference frame. This has the potential to remove packed layers, and simplify the CPU code that deals with clip-scroll groups etc.
Once the basic brush shader is in place, we start incrementally moving other primitive types over to the same system, allowing us to remove the concept of packed layers etc and have all primitives using the new compressed vertex format.
cc @msreckovic @jrmuizel @kvark @nical @mrobinson
cc @Gankro
Another benefit of this plan is that the brush type no longer has separate shaders for transform vs. axis aligned variants. Instead, we have an alpha pass variant (that applies AA and clip masks) for the alpha segments, and an opaque variant. Both the alpha and opaque variant handle transforms without having to switch between transform / non-transform shaders and breaking batches.
The ring of it sounds good but I am not sure I really understand a lot of it.
This can handle solid colors, textures and simple gradients. We use this for rectangles, images, line decorations, text (potentially), simple gradients etc.
Just to be sure because everything seems to revolve around this: The brush shader is some sort of uber-shader that would replace the shaders for rectangles, images, etc, right?
I remember you saying something in Toronto about being careful about not bloating the rect shader too much because it touches so many pixels.
Coherent branching is quite fast these days, but often VGPR pressure depends on the biggest branch even if not taken, so that's something to be careful about. That said, I like having the possibility to merge batches and it doesn't necessarily preclude having a dedicated solid color shader for specific cases like very large backgrounds I suppose. We could select between the two ways to draw solid colors depending on how much it breaks batches vs the area to fill.
...or just measure that the uber-shader is fast enough and not bother, but we'd need to be thorough about testing on different hardware.
Having a new batch primitive type allows us to start using a compressed vertex format
By vertex format here you mean vertex format as in what's in the VBO, or the format of the data we pass in the gpu cache texture that the vertex shader consumes?
On the brush primitive type, support the concept of "segmenting" the primitive rect based on the presence of any local space clips into opaque and transparent regions.
I am not sure i am getting 100% of the implications behind segmenting (if there is anything other than splitting a primitive into smaller primitives).
we modify it to have knowledge of reference frames and scroll frames
I don't know what this involves at all (not saying I disapprove, I am just curious about the details).
For the record, FastUIDraw takes a uber-shader based approach and it seems to work well for them (on intel hardware).
@glennw awesome write-up, thank you for looking at performance and analyzing the situation! 馃憦
I generally agree, there is a lot of points here that can be seen as immediate action items for us. I especially like the idea of using the clip masks more :). Here are the things I'd like to add/correct:
The DL deserialization time is a big problem. It's often ~50% of the total frame time when I profile. Fixing that is unrelated to the other things I'll mention, but I list it here since it has such a massive potential effect on CPU performance.
Our DL deserialization cost includes the flattening. This is done per frame/scene. In some cases, there is just a tiny thing changed that causes the whole frame tree to be rebuilt and then re-flattened. One way to address it is to use the document API aggressively. Make sure the UI chrome is in a different scene (thus, a document) than the page, so that they can be flattened/updated independently.
Another idea, IIRC from our discussion on All Hands, was to provide the bounding boxes in a separate stream, so that we can ignore the invisible elements without even deserializing/flattening them.
On the sites we're seeing in Gecko, our draw call counts are higher than ideal. They are still good (typically ~50 per frame), but improving this would be a good win for both CPU time (batching) and GPU time (shader switches etc).
I don't think trying to reduce from 50 switches to less would give us big wins, tbh, both in terms of CPU and GPU time. I see us past the point of diminishing returns on that front. The other optimizations (notably, moving more stuff into the opaque pass) should be significantly more cost efficient for the effort.
Note that while this is just a note from the list, it does affect one of the main points in proposal - the "brush" shader. I hope we'll not end up with this shader dominating most of the rendering. For example, I expect a simple filling rectangular shader to still be needed, for it having better CU utilization than a more generic counterpart.
Subpixel text rendering. There is a GL extension present on all hardware except early Intel gen6 that can make subpixel text rendering much faster.
Could you provide the details?
Isn't this similar to what we did way back in the WR1 days, which had a more ubershader-like approach?
I kind of think that maybe two basic "solid color" and "image" might be the way to go, with solid colors and simple gradients easily collapsible into one shader. Both shaders would support alpha masks. My work with Pathfinder seems to suggest that below a minimum fragment shader size the overhead of the ROPs dominates the FS overhead, so it's not worth microoptimizing too much here.
I do tend to agree with @kvark that below 50 draw calls we aren't likely to see a lot of benefit. Factoring into a smaller number of shaders might make it easier for me to understand WR again ;) But I'm not sure it'll move the needle on performance.
On the flip side, there was a time not so long ago when 50 draw calls was the absolute maximum we could afford on a lot of mobile hardware (whatever the content of the draw calls). I wouldn't be surprised that these days are mostly behind us, but it wasn't that long ago.
Also It'd be useful to see this in terms of the worst case scenario. A compositor based architecture is typically good at having roughly the same behavior with simple and complex web pages, but with WebRender the best and worst cases tend to be more extreme. If 50 draw calls is not reason enough to rework the shaders, we can still revisit this solution when we deal with the long tail of pages that create a bunch of times as many. That or some other solution, but I just wanted to put into perspective that I am less worried about the typical page than the ones that give us a hard time.
One way to address it is to use the document API aggressively. Make sure the UI chrome is in a different scene
It would be really good to move forward with something like this.
Sure, I don't think anyone would complain about fewer draw calls long-term. I'm just saying that optimizing clips is probably more important in the short term.
I think I may have given the wrong impression with talk of the brush shader. Although it could be implemented as an uber-shader, that's not my intention. The main reason is to provide an incremental way to move primitives over to an approach that supports drawing primitives with segments.
I envisage the brush primitive having a small number of shaders, similar to what @pcwalton mentioned above, but that's really an implementation detail (ubershader might make sense on some GPUs for example). The most important part is an incremental way to start moving primitives over to supporting segments. In answer to @nical the only real goals of segmenting are (a) move more pixels into the opaque pass and (b) allow some of the clip mask generation to be faster (e.g. a corner segment only needs to evaluate one ellipse SDF, instead of the four it does now).
In terms of the document API, everyone seems to be on the same page with using that for the chrome etc - this seems like a good thing to work on in parallel to the rest, since it's basically orthogonal to all the above.
@nical In terms of the vertex formats - yes, we should be able to make our brush vertex format significantly smaller.
@kvark The GL extension that makes subpixel rendering more efficient is https://www.khronos.org/opengl/wiki/Blending#Dual_Source_Blending. With this, we no longer need to break batches based on the color of the text. Additionally, it might even make sense in that case for the text shader to be unified with the brush image shader.
These issues below are all in some way related to this work:
Each of these issues are either (a) an individual component of this work or (b) fixed by the changes proposed here or (c) easier to fix once this work is complete.
Update:
The initial brush structures and shaders have landed. Built on top of these are some significant performance optimizations for (common) box shadows. Additionally, we've started the work to improve the performance of clips (using primitive clip masks and caching the clip chains at the start of the frame).
Looking at the performance of rendering some simple sites, including about:blank there are some obvious issues with how we are selecting what items to be assigned to render targets. For example, on about:blank Gecko is submitting a 1920x1446 stacking context with opacity(0) that is still being allocated and blended onto the main scene.
I could add a hack to handle this case, but doing the correct fix will sort this, and a heap of other known issues. Specifically, what I'm planning next is:
Picture and PrimitiveRun structs during the frame flattening. During the render task creation, this tree is what gets walked, instead of the stacking context tree.Once this is complete, we get the following (either as part of this task, or simple follow ups):
Addendum: The work above will also be the foundation that allows us to accurately work out where we can use sub-pixel AA on intermediate targets.
Note to self: the build step of Picture is a good place to apply optimizations. A couple of examples of simple optimizations that could have significant performance wins:
Note to self: We seem to be invoking split composites and plane splitting in a lot of places where it doesn't seem necessary. This could probably explain what looks like bad batching in the profiler on a lot of sites...
This is basically done now. We still have a couple of primitives that don't run through brush types yet, but we're working towards them incrementally (handling the remaining legacy image shader, border clip sources).
Most helpful comment
Update:
The initial brush structures and shaders have landed. Built on top of these are some significant performance optimizations for (common) box shadows. Additionally, we've started the work to improve the performance of clips (using primitive clip masks and caching the clip chains at the start of the frame).
Looking at the performance of rendering some simple sites, including
about:blankthere are some obvious issues with how we are selecting what items to be assigned to render targets. For example, onabout:blankGecko is submitting a 1920x1446 stacking context withopacity(0)that is still being allocated and blended onto the main scene.I could add a hack to handle this case, but doing the correct fix will sort this, and a heap of other known issues. Specifically, what I'm planning next is:
PictureandPrimitiveRunstructs during the frame flattening. During the render task creation, this tree is what gets walked, instead of the stacking context tree.Once this is complete, we get the following (either as part of this task, or simple follow ups):