Guide: https://software.intel.com/sites/default/files/managed/3a/93/6th-gen-graphics-api-dev-guide-1.1.pdf
Some stuff seems to be relevant to us:
Use discard (or other kill pixel operations) where output will not contribute to the
final color in the render target.
Wow! Doing this can be super bad on a lot of GPUs (it will turn off early-Z and Hi-Z on at least nvidia and powervr and I suppose AMD as well although I need to check). I am really surprised that intel chips don't suffer from doing things like this.
The first two items don't mesh very well with how we store primitive data in a big float texture. Hopefully it won't matter much since that's all in the vertex shader and we don't seem to be spending too much time in there. But it would be worth reconsidering this approach for some specific things like if/when we introduce tessellated geometry with lots of vertices for paths.
Wow! Doing this can be super bad on a lot of GPUs (it will turn off early-Z and Hi-Z on at least nvidia and powervr and I suppose AMD as well although I need to check). I am really surprised that intel chips don't suffer from doing things like this.
Yeah, I actually noticed that Intel really suffers when early Z is off. So I'm not sure this advice should be taken at face value…
Trying to answer some of your questions here :
1) This one is straight forward. Can’t hide sampler latency with other math ops if we’re have to wait on a sample to finish before we can continue.
2) I’m trying to pinpoint the motivation for this. There are a bunch of general reasons why SOA is better than AOS, but I’m trying to find a reason why it’s specifically better when it comes to vertex buffer input.
3) Yeah, I talked with 2 of my other teammate, and we agreed that the advice written in the guide seems “dubious at best”. The guide also has a section on Early-Z Rejection which just recommends a depth-prepass. We’re thinking it must be talking about non-depth/stc situations, or even in a test-only/write-only situation. So a depth-only prepass followed by a color-only pass with discard should probably see an improvement over one combined pass. The guide talks about avoiding PMA stall in 4.4: “Avoid spatial overlap of geometry within a single draw when stencil operations are enabled. Pre-sort geometry to minimize overlap, commonly seen when performing functions such as foliage rendering.” It’s interesting that they don’t directly mention discard. Let me follow-up on this one ...
4) I think the advice is more like “make use of it when applicable because it’s there” and because it can be up to 2x faster.
5) Similar to SOA, there are a lot of reasons power of two dimensions are convenient for everyone. I would guess the main ones for us are because they work well with our tiling, are divisible by the length of a cache line, and pack well for resource allocation. No wasted padding and such.
@kvark Do we need this open? We could perhaps open specific issues if there's work to be done off this?
Alright:
Most helpful comment
Trying to answer some of your questions here :
1) This one is straight forward. Can’t hide sampler latency with other math ops if we’re have to wait on a sample to finish before we can continue.
2) I’m trying to pinpoint the motivation for this. There are a bunch of general reasons why SOA is better than AOS, but I’m trying to find a reason why it’s specifically better when it comes to vertex buffer input.
3) Yeah, I talked with 2 of my other teammate, and we agreed that the advice written in the guide seems “dubious at best”. The guide also has a section on Early-Z Rejection which just recommends a depth-prepass. We’re thinking it must be talking about non-depth/stc situations, or even in a test-only/write-only situation. So a depth-only prepass followed by a color-only pass with discard should probably see an improvement over one combined pass. The guide talks about avoiding PMA stall in 4.4: “Avoid spatial overlap of geometry within a single draw when stencil operations are enabled. Pre-sort geometry to minimize overlap, commonly seen when performing functions such as foliage rendering.” It’s interesting that they don’t directly mention discard. Let me follow-up on this one ...
4) I think the advice is more like “make use of it when applicable because it’s there” and because it can be up to 2x faster.
5) Similar to SOA, there are a lot of reasons power of two dimensions are convenient for everyone. I would guess the main ones for us are because they work well with our tiling, are divisible by the length of a cache line, and pack well for resource allocation. No wasted padding and such.