Dxvk: Shader input copies are mean to compilers

Created on 24 Jul 2018  路  12Comments  路  Source: doitsujin/dxvk

DXVK really likes to generate this pattern:

layout(location=0) in vec3 a1;
layout(location=1) in vec2 a2;
vec4 shader_in[32];
void vs_main() {
    // Do stuff
}

void main() {
    shader_in[0].xyz = a1;
    shader_in[0].xy = a2;
    vs_main();
}

This works, but unfortunately it's kind-of a pain for the compiler to chew through. It's not so bad for vertex shaders but for tessellation shaders it's especially bad. I'm seeing this pattern several places:

layout(location=0) in vec3[3] a1;
layout(location=1) in vec2[3] a2;
vec4 shader_in[3][32];
void tcs_main() {
    vec3 a = shader_in[gl_InvocationId][0].xyz;
    vec3 b = shader_in[gl_InvocationId][1].xy;
    // Do stuff
}

void main() {
    shader_in[0][0].xyz = a1[0];
    shader_in[1][0].xyz = a1[1];
    shader_in[2][0].xyz = a1[2];
    shader_in[0][0].xy = a2[0];
    shader_in[1][0].xy = a2[1];
    shader_in[2][0].xy = a2[2];
    tcs_main();
}

Right now, our compiler is doing a fairly literal translation which is rather problematic when you have a tessellation shader with 8 inputs each of which has 9 vertices; that's 4.5 KB of input data that get loaded and then stuffed into a temporary array. That array then gets spilled out to scratch space because it's 4.5 KB and the shader both slow and a mess to read/debug. This happens on even really simple shaders that just copy their inputs into the outputs.

What we'd like to have is vec3 a = a1[gl_InvocationId]. Unfortunately, turning what we get (which came from something like that) back into something sensible requires the compiler to figure out quite a bit of information:

  1. Only the x, y, and z components of shader_in[*][0] are ever read
  2. shader_in[*][0] is basically a vec3
  3. Because it's basically a vec3, the write-masks don't matter
  4. Since the write-masks don't matter, the assignment to shader_in[*][0] of a1 copies the entire array
  5. Since the assignment is a copy, we can treat a read from shader_in[x][0] as a read of a1[x]

That train of thought is easy for me to write down and you to read but making our compiler figure it all out is turning out to be rather painful. :-/ It's especially frustrating because DXVK clearly has enough information to declare the original inputs with their proper size. Is there some way DXVK could generate a bit nicer code? I know DX lets you do some crazy indirecting on inputs but maybe we can only make the copy if it's really needed?

Most helpful comment

I've got mesa patches in the works that seem to clean this up fairly nicely. If and when they land, they should fix any issues for both Intel and RADV. The series can be found here:

https://patchwork.freedesktop.org/series/47295/

All 12 comments

I'm aware that this is a problem for tess/geo shaders, but it would be incredibly hard to do, and in some cases impossible. Inputs can be dynamically indexed, system values are also arbitrarily stuffed into the input register space, a single input register can consist of multiple input variables, and some parts of a register might not even be defined.

Even declaring inputs with the proper size is somewhat questionable because there's no guarantee that the shader interfaces match between stages, but it was necessary in order to not exceed geometry shader output component limits. Games just seem to be sensible enough to use compatible interfaces in practice.

I'll try to figure out something, but tbh I don't expect to come up with a solution any time soon. While the dynamic indexing case could be sensibly worked around by just not declaring and using the array if things don't get dynamically indexed, system values partially covering input registers are a massive pain to deal with.

Tessellation shader interfaces are even worse, where output registers can be dynamically indexed in the TCS, so DXVK just gives up figuring anything out and declares the outputs as a vec4[32]. This is also causing issues in practice but I just don't know how to generate reasonable code in such cases.

I think I've got something of a plan to try and fix this up in our compiler. Let's see how far I can get. If I can write code to chew through it and dump out something reasonable, then it should fix the issue for both anv and radv.

Hm okay. I mean, asking DXVK not to emit silly code is reasonable in my opinion, but in this case it's just not a particularly easy thing to do.

I've got mesa patches in the works that seem to clean this up fairly nicely. If and when they land, they should fix any issues for both Intel and RADV. The series can be found here:

https://patchwork.freedesktop.org/series/47295/

Hi @jekstrand,
did the paches for radv also landed in mesa or just for anv?

@xxmitsu, No, radv does not yet enable the optimizations.

How much of an impact does this actually have on your hardware?

I discovered some patterns in DXBC shaders when accessing input arrays, which might make it possible to optimize the global input array away, and also generate more sensible code for the tessellation control -> tessellation evaluation interface, although I'm not 100% confident that this is actually the case since those access patterns are not documented.

A quick test supporting only vertex and fragment shaders did not show any performance improvement on RADV or Nvidia, but as you noted, Tessellation and Geometry shaders are more problematic.

I just ran the Batman: Arkham City (the game which inspired these optimizations) benchmark with and without them and it's adifference of around 15% on the benchmark average. It seemed like most of the benchmark wasn't too badly affected but there was this one scene that just crawled.

It's worth noting that how badly this affects performance is going to be highly compiler-dependent. In particular, how the compiler handles large indirectly accessed temporary arrays. The Intel compiler is admittedly pretty terrible at this so maybe the radeon LLVM back-end just does a better job.

@jekstrand, Does RADV need heavy patching for these optimizations to work?

If not, I'd be happy to try on the same benchmark to see how similar the benefits would be.

@John-Gee I am told that the reason this is an issue on Intel is that the hardware doesn't handle dynamic indexing well. If AMD hardware handles it well, then the driver optimization is unnecessary.

Timothy landed patches to enable the optimizations in RADV this week. No idea what the actual perf numbers are at the moment.

I've just tried a little benchmark of this with a 280X and a Haswell i7, before these 4 patches and after, but I saw no difference. The test was done with an empty cache, which I assume is where this would shine.

So maybe Richard is correct.

To be thorough: I was not able to get the DXVK_HUD displayed though so not sure if DXVK worked, but it did display the usual stuff in the terminal.
To get D3D11 I allowed it in 2 inis, Launcher.ini and BaseEngine.ini, then turned on the options in the game launcher, and set everything to high.
All in a 32b prefix since it requires dotnet which is annoying on a 64b one.

Thank you!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

SveSop picture SveSop  路  133Comments

FurretUber picture FurretUber  路  51Comments

buscher picture buscher  路  79Comments

oscarbg picture oscarbg  路  51Comments

pingubot picture pingubot  路  112Comments