On our AMD Window 10 machine the score will go from around 400 to 80. @kvark thinks we might be fragmenting the GPU cache. I suggested it might be useful to expose the size of the gpu cache in the profiler hud to see if that might be the persistent state that's causing the change here.
Related - #2110, which was originally filed to address the issue, but it doesn't capture the extent in full.
I believe what's happening here is that GPU cache gets fragmented. Since we are uploading the whole rows, we end up transferring too much data, hitting Angle's soft spots. Steps to take:
We confirmed that the GPU cache texture quickly grows from 1024x512 to 1024x2048, and then the performance still degrade with time. This is supporting the fragmentation hypothesis but not confirming it just yet.
I took a look at the GPU cache usage in MM ramping complexity using the new shiny bars of #2139
I can see that 773 rows are allocated, and about 1/4 or 1/3 of them are updated each frame in peak usage. This is waaay more than we need to, and no wonder it's inefficient. The uploaded actual block count is so small it's hard to tell, certainly below 5%.
The GPU cache uploads an entire row at a time, as soon as any block within that row is dirty. This can (worst case) mean a 16 kB upload for a row if a single 16-byte block has changed (!). This is one of those things that was meant to be a "come back later an optimize when we have useful data" :)
The problem with just issuing an update for every block-run that has changed is that I was seeing the CPU time get swamped by the number of API calls to update small blocks. That may no longer be the case (we should check), but if it is, we'll need to do something a bit smarter.
A couple ideas we could try:
@kvark Is this still relevant?
Baaam!
Most helpful comment
We confirmed that the GPU cache texture quickly grows from 1024x512 to 1024x2048, and then the performance still degrade with time. This is supporting the fragmentation hypothesis but not confirming it just yet.