Gpuweb: Proposal: synchronize unordered access views at pass boundaries

Created on 5 Jun 2018  路  5Comments  路  Source: gpuweb/gpuweb

TL;DR: WebGPU should preserve the order of all side effects (transfer read/writes, pixel read/writes) except for Unordered Access Views, which are only synchronized at the render/compute pass boundaries.

Introduction

Graphics hardware provides certain guarantees about the order of operations. We don't observe the actual execution order of the shaders, since they are largely executed in parallel, but we can observe their side effects, such as output colors written to the texture targets.

In a graphics pass, the only side effect that is not ordered is Unordered Access Views (or UAV in short - the term comes from DirectX, while GL land calls it shader storage buffers - SSBO) writes. All the other side effects are ordered according to the primitive submission order (draw call order -> instance order -> primitive order):

  • RTV and DSV during rasterization (pixel, depth, and stencil writes)

    • note: AMD_rasterization_order allows to remove the ordering guarantees from raster operations in order to get a 5% performance gain. We should be able to figure out internally if doing so introduces any (additional) data races without exposing it to the user.

  • stream output (i.e. transform feedback)
  • Ordered Access Views (OAV)

In a compute pass, UAV is the only way to get something out, so there is nothing left to be ordered. In transfer operations, read-write and write-write hazards are possible, and pipeline barriers are required to serialize those.

UAV

The mechanics of a UAV in D3D12 and Vulkan is such that the user is expected to place memory barriers if they want to serialize the side effects. In Metal, UAV are forcefully serialized at the draw call granularity in compute passes, and at the render pass granularity otherwise.

It's at the core of an UAVs to produce data races, and that is what allows the hardware to read/write them efficiently (no need to synchronize/serialize access). Therefore, for performance/efficiency reasons we don't believe that it's worth trying to enforce synchronization at a finer level than the draw calls. The cost of draw call-level synchronization is also expected to be unacceptably high for render operations, since a tiling GPU would have to flush the whole tile before proceeding after such a barrier. For this reason, Vulkan supports only a very limited set of pipeline barriers inside render passes.

Proposal

Document UAVs as a special kind of resource view that has a wide synchronization scope - the render/compute pass boundary. Any dependent operations within this scope are then considered non-portable, although it would be hard (if possible at all) for an implementation to detect those and warn appropriately.

Note that the proposal is based on the constraint that each resource would have to be only in either a single writable state or a combination of readable states during a pass. This automatically prevents a situation where the user would want to write to an UAV and then re-bind as an SRV/CBV within a pass.

For transfer operations, the API knows precise resources affected and their ranges, since those are explicitly provided by the user for copy/blit calls. Therefore, an implementation can figure out the possible hazards and insert appropriate barriers automatically. It doesn't have to be smart, could just optimize later by removing some of the barriers it considers unnecessary. For this reason, grouping operations into a "transfer/copy pass" does not appear to bear much of a value, and we think the group should reconsider having those passes in the API.

Issues

Why not insert automatic barriers between compute dispatches like in Metal?

Mainly because it's not consistent with render passes. If compute UAV side effects are synchronized at the dispatch boundary, then the users will seek ways to avoid hitting that synchronization point in cases where their use of an UAV is guaranteed to be portable at the logic level that is not visible to WebGPU implementation. These ways could consist of trying to build mega-shaders that do many operations at once, which is counter to what they'd do in Vulkan/DX12 and not productive (working around the API instead of taking the benefit of it). If the UAVs are synchronized at the pass barriers, the users always have an option to break a pass (and start a new one) if they need to depend on previous writes.

In Metal, automatic barriers made more sense because there is no constraint on a resource usage being static across a pass. If we do this in WebGPU, then we'd need to reconsider the static usage constaints, and it would hurt optimal performance on Vulkan and D3D12 backends.

investigation proposal question

All 5 comments

This is very similar to our point of view. The only difference is that we think maybe we should provide a barrier operation in compute shaders instead of stopping/starting a pass, but it is conceptually very similar.

The part about not having a separate copy pass is a slightly orthogonal proposal (even if we agree with it.)

While we might eventually need barrier operations for compute passes, I think we can get away with excluding them for the moment, so that we can wait until later to discuss adding them and what they might look like.

stream output (i.e. transform feedback)

Transform feedback is obsolete can be easily emulated with UAV's and atomics, moreover it suffers from serious limitations such as:

  1. The number of components you can output
  2. The layout of the components you output
  3. Alignment of the components you output
  4. You cannot output to the same Buffer you're reading from, even if its a different sub-range that you have bound

@devshgraphicsprogramming right, we aren't even considering adding transform feedback at this point. I just brought it up for picture completeness.

On another note you may have some confusion going on between synchronisation operations and memory barrier operations as well as thier scopes.

This is very similar to our point of view. The only difference is that we think maybe we should provide a barrier operation in compute shaders instead of stopping/starting a pass, but it is conceptually very similar.

Execution dependency does not ensure a memory dependency.
Althought a memory dependency needs an execution dependency to be ensured.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Kangz picture Kangz  路  4Comments

kainino0x picture kainino0x  路  5Comments

Nielsbishere picture Nielsbishere  路  5Comments

zoddicus picture zoddicus  路  6Comments

kvark picture kvark  路  4Comments