Webrender: Tips from Intel GPU guide

Created on 20 Jul 2017  Â·  6Comments  Â·  Source: servo/webrender

Guide: https://software.intel.com/sites/default/files/managed/3a/93/6th-gen-graphics-api-dev-guide-1.1.pdf

Some stuff seems to be relevant to us:

  1. Avoid read hazards between sample instructions. For example, the UV coordinate of the next sample instruction is dependent upon the results of the previous sample operation.
  2. Define input geometry in structure of arrays (SOA) instead of array of structures
    (AOS) layouts for vertex buffers, for example, by providing multiple streams (one for
    each attribute) versus a single stream containing all attributes.
  3. Use discard (or other kill pixel operations) where output will not contribute to the
    final color in the render target.
  4. Use medium precision for OpenGL ES contexts, for improved performance.
  5. Textures with power-of-two dimensions will have better performance in general.
help wanted

Most helpful comment

Trying to answer some of your questions here :
1) This one is straight forward. Can’t hide sampler latency with other math ops if we’re have to wait on a sample to finish before we can continue.
2) I’m trying to pinpoint the motivation for this. There are a bunch of general reasons why SOA is better than AOS, but I’m trying to find a reason why it’s specifically better when it comes to vertex buffer input.
3) Yeah, I talked with 2 of my other teammate, and we agreed that the advice written in the guide seems “dubious at best”. The guide also has a section on Early-Z Rejection which just recommends a depth-prepass. We’re thinking it must be talking about non-depth/stc situations, or even in a test-only/write-only situation. So a depth-only prepass followed by a color-only pass with discard should probably see an improvement over one combined pass. The guide talks about avoiding PMA stall in 4.4: “Avoid spatial overlap of geometry within a single draw when stencil operations are enabled. Pre-sort geometry to minimize overlap, commonly seen when performing functions such as foliage rendering.” It’s interesting that they don’t directly mention discard. Let me follow-up on this one ...
4) I think the advice is more like “make use of it when applicable because it’s there” and because it can be up to 2x faster.
5) Similar to SOA, there are a lot of reasons power of two dimensions are convenient for everyone. I would guess the main ones for us are because they work well with our tiling, are divisible by the length of a cache line, and pack well for resource allocation. No wasted padding and such.

All 6 comments

Use discard (or other kill pixel operations) where output will not contribute to the
final color in the render target.

Wow! Doing this can be super bad on a lot of GPUs (it will turn off early-Z and Hi-Z on at least nvidia and powervr and I suppose AMD as well although I need to check). I am really surprised that intel chips don't suffer from doing things like this.

The first two items don't mesh very well with how we store primitive data in a big float texture. Hopefully it won't matter much since that's all in the vertex shader and we don't seem to be spending too much time in there. But it would be worth reconsidering this approach for some specific things like if/when we introduce tessellated geometry with lots of vertices for paths.

Wow! Doing this can be super bad on a lot of GPUs (it will turn off early-Z and Hi-Z on at least nvidia and powervr and I suppose AMD as well although I need to check). I am really surprised that intel chips don't suffer from doing things like this.

Yeah, I actually noticed that Intel really suffers when early Z is off. So I'm not sure this advice should be taken at face value…

Trying to answer some of your questions here :
1) This one is straight forward. Can’t hide sampler latency with other math ops if we’re have to wait on a sample to finish before we can continue.
2) I’m trying to pinpoint the motivation for this. There are a bunch of general reasons why SOA is better than AOS, but I’m trying to find a reason why it’s specifically better when it comes to vertex buffer input.
3) Yeah, I talked with 2 of my other teammate, and we agreed that the advice written in the guide seems “dubious at best”. The guide also has a section on Early-Z Rejection which just recommends a depth-prepass. We’re thinking it must be talking about non-depth/stc situations, or even in a test-only/write-only situation. So a depth-only prepass followed by a color-only pass with discard should probably see an improvement over one combined pass. The guide talks about avoiding PMA stall in 4.4: “Avoid spatial overlap of geometry within a single draw when stencil operations are enabled. Pre-sort geometry to minimize overlap, commonly seen when performing functions such as foliage rendering.” It’s interesting that they don’t directly mention discard. Let me follow-up on this one ...
4) I think the advice is more like “make use of it when applicable because it’s there” and because it can be up to 2x faster.
5) Similar to SOA, there are a lot of reasons power of two dimensions are convenient for everyone. I would guess the main ones for us are because they work well with our tiling, are divisible by the length of a cache line, and pack well for resource allocation. No wasted padding and such.

@kvark Do we need this open? We could perhaps open specific issues if there's work to be done off this?

Alright:

  1. Let's keep this in mind when investigating shader performance bottlenecks. I don't think there is anything outstanding right now that can be addressed on this front, but yes, we do make quite a few dedendent fetches, especially in vertex shaders.
  2. remains unclear, but I don't think it matters too much for us: we don't use that much of vertex data anyway, mostly fetching it from textures
  3. also unclear, but is important
  4. clear, moved into #2535
  5. our atlas and GPU cache textures are power of 2, so we are good
Was this page helpful?
0 / 5 - 0 ratings