Webrender: MotionMark bouncing circles is worse the second time it's run

Created on 29 Nov 2017 · 6Comments · Source: servo/webrender

On our AMD Window 10 machine the score will go from around 400 to 80. @kvark thinks we might be fragmenting the GPU cache. I suggested it might be useful to expose the size of the gpu cache in the profiler hud to see if that might be the persistent state that's causing the change here.

performance bug

Source

jrmuizel

👍1

Most helpful comment

We confirmed that the GPU cache texture quickly grows from 1024x512 to 1024x2048, and then the performance still degrade with time. This is supporting the fragmentation hypothesis but not confirming it just yet.

kvark on 29 Nov 2017

👍3

All 6 comments

Related - #2110, which was originally filed to address the issue, but it doesn't capture the extent in full.

I believe what's happening here is that GPU cache gets fragmented. Since we are uploading the whole rows, we end up transferring too much data, hitting Angle's soft spots. Steps to take:

confirm the hypothesis by looking at the GPU cache structure. If we don't have a way to inspect it, we should add one
add statistics to tell us how much data is uploaded, and what ratio of it is the valuable payload
consider implementing GPU cache de-fragmentation step

kvark on 29 Nov 2017

👍3

I took a look at the GPU cache usage in MM ramping complexity using the new shiny bars of #2139
I can see that 773 rows are allocated, and about 1/4 or 1/3 of them are updated each frame in peak usage. This is waaay more than we need to, and no wonder it's inefficient. The uploaded actual block count is so small it's hard to tell, certainly below 5%.

kvark on 3 Dec 2017

The GPU cache uploads an entire row at a time, as soon as any block within that row is dirty. This can (worst case) mean a 16 kB upload for a row if a single 16-byte block has changed (!). This is one of those things that was meant to be a "come back later an optimize when we have useful data" :)

The problem with just issuing an update for every block-run that has changed is that I was seeing the CPU time get swamped by the number of API calls to update small blocks. That may no longer be the case (we should check), but if it is, we'll need to do something a bit smarter.

A couple ideas we could try:

We could try just reducing the width of the GPU cache texture. This will limit the max size of a text run, but we do split text runs for this case now, and even a 256 wide texture would be big enough for the vast majority of text runs to not be split. This is probably the simplest / hackiest solution, but might be worth experimenting with.

Have some threshold for the number of dirty blocks in a row. For example, we could say, if more than a quarter of the blocks are dirty, just upload the entire row. Otherwise, issue uploads for each dirty block run. We'd have to work out a good threshold, but something like this might be quite reasonable.