Gpuweb: Multi-Queue Investigation

Created on 11 Sep 2020 · 5Comments · Source: gpuweb/gpuweb

Intro

(edits: list of queues specified at creation)

Using multiple queues in low-level API is a good way to make sure the compute units are always busy with useful work. Most popular use case is "async compute" where, in addition to the main queue, there are 1-2 compute-only queues crunching up the data, some of which may be needed on the main queue.

Some links:

https://docs.microsoft.com/en-us/windows/win32/direct3d12/user-mode-heap-synchronization
https://gpuopen.com/learn/concurrent-execution-asynchronous-queues/
https://gpuopen.com/presentations/2019/Vulkanised2019_06_optimising_aaa_vulkan_title_on_desktop.pdf

Using multiple-queues is not mandatory to get the job done, it's purely an optimization, that allows more efficient use of hardware. However, it's important for WebGPU to get this right as it may affect the synchronization design in general. We want to at least be sure that multi-queue support can be added without changing the API.

Therefore, this investigation is focused on the synchronization aspect, and not the surrounding logic of queue discovery and utilization. Related to #478

Vulkan

The available queues are discovered via the physical device, and need to be requested at logical device creation. Vulkan exposes multiple families of queues, each having a set of capabilities (like the ability to do compute, graphics, or transfer operations), and one or more logical queues.

If a resource can be used by a queue family, it can be used by any of the queues in this family, without any more explicit synchronization than just regular semaphores.

As for using the resource by different queue families, Vulkan has the sharing mode, which has to be specified at resource creation.

Exclusive:
Only one queue family can access that resource at any given time.

In order to use it on a different queue family, a "transfer" operation needs to be encoded in command streams on both queues:

the old queue needs a "release" pipeline barrier, but only if the contents of the resource need to be preserved. If the resource is cleared right away on the new queue, this barrier can be omitted.
the new queue needs an "acquire" pipeline barrier
submissions for these commands have to be synchronized by a semaphore

Concurrent:
Any queue family can access the resource. A resource has to specify, at creation, the list of queue families that will be able to access it.

In addition to making the "transfer" semantics implicit, it also unlocks a case where a resource is used (for reading) simultaneously on multiple queue families.

Having the concurrent sharing mode comes with performance implications: drivers have to disable color compression for textures, for example.

D3D12

A device can spawn queues, as many as needed. Each resource can be either mutably accessed on a single queue, or simultaneously accessed for reading on multiple queues, at a given time. Queues can be synchronized with each other with fences (which are analogous to Vulkan semaphores, but more powerful). This, so far, looks like the "concurrent" mode of Vulkan.

Copy "engines" (which is D3D12's second name for queues) are defined as a separate "class". So resource states COPY_DEST and COPY_SOURCE aren't observed by all queues, but instead considered separate by the copy and non-copy queues. We can see it as a need to do the "ownership transition" (like with Vulkan's exclusive sharing mode). However, in D3D12 it's not necessary to do a "release" transition, given the implicit state decay rules (if I understand correctly), thus it's simpler to implement (but not optional, like in Vulkan).

Metal

(I know least about this one, section is to be edited!)

In Metal-1, it was possible to create many queues, but there was no way to synchronize access between them. Different queues were meant to do work that is totally independent.

In later Metal (citation needed), MTLEvent was added, and it can synchronize between queues of the same device (just like VkSemaphore or ID3D11Fence).

I wasn't able to find concrete information on whether it's valid to use the same resource by multiple queues, simultaneously, and under which conditions.

investigation multi-queue

Source

kvark

Most helpful comment

Here are some details about how multi-queue in Metal works:

From an API perspective, there are no internal layouts of Metal resources. If one queue wants to use a texture as a copy source and another one wants to sample from it in a shader, they are free to do that at the same time. Reads are reads, and writes are writes; from an API perspective, that's as far as the distinction goes.

Metal does automatic hazard tracking for an entire device, and it considers submissions from all queues when performing this tracking. However, before this hazard tracking occurs, queue submits travel through an internal worker thread, and there's one thread for each MTLCommandQueue.

For example, if you have a single-threaded application and you submit to two distinct queues, and the submissions are mutually hazardous, Metal will guarantee that one will execute before the other, but not guarantee which one executes first. In this example, the two submissions go to two worker threads, which race with each other, but the work items will be serialized in the kernel, which will realize that the two submissions are mutually hazardous, and will enforce barriers between whichever one it happened to receive first and whichever one it happened to receive second.

In the same example, if the two submissions are not hazardous, they are free to execute on the GPU concurrently. Indeed, even if the two submissions occur on the same MTLCommandQueue, if they are not hazardous, they are free to execute on the GPU concurrently.

One way that authors can enforce ordering between their submissions is to use the scheduledHandler to their command buffers. This will be called after the kernel "sees" the submission and tracks its resources' usage. Authors can then use this callback to commit hazardous work on another queue that will be guaranteed to execute after the first submission.

Another way authors can enforce ordering between their submissions is to use untracked resources (this includes resources in untracked heaps) and MTLEvents (_not_ MTLFences). Untracked resources opt-out of the hazard-tracking machinery described above. MTLEvents are more powerful than the automatic resource tracking in that you can make the device wait on something that your program hasn't even gotten around to start to think about signaling yet. Therefore, with great power comes great responsibility: you can deadlock the device pretty easily (though we'll gracefully timeout and mark the command buffer as having an error).

(Aside: The last way authors can enforce ordering between their submissions is to use a single queue. You don't get async compute by using multiple queues in Metal; you get async compute automatically by default. The major reason why multiple queues exist in Metal is because MTLCommandBuffer.commit() isn't threadsafe in regard to a single queue. If there was only one queue, an application that wants to record command buffers on multiple threads would have to serialize their commit() calls themself. With multiple queues, each CPU thread can get its own queue, and the commit() calls will be safe.)

litherum on 13 Oct 2020

👍3

All 5 comments

Having the concurrent sharing mode comes with performance implications: drivers have to disable color compression for textures, for example.

Just to add to that here's the code that requires the image to be exclusive to do "fast clear" using color compression in radv.

Each resource can be either mutably accessed on a single queue, or simultaneously accessed for reading on multiple queues, at a given time.

That's true for buffers, but requires the D3D12_RESOURCE_FLAG_ALLOW_SIMULTANEOUS_ACCESS for textures (and isn't allowed for multisampled or depth/stencil textures).

Kangz on 11 Sep 2020

Question from reading the proposals. They both tie a command buffer to a specific queue, but iiuc (at least in Vulkan) they only have to be tied to a queue family. Is there value in generalizing to families? E.g. applications could decide late which queue in a family to use?

kainino0x on 11 Sep 2020

More explanation on multi-queue to help facilitate discussion:
https://github.com/gpuweb/gpuweb/wiki/The-Multi-Explainer#multi-queue

kainino0x on 1 Oct 2020

Here are some details about how multi-queue in Metal works:

litherum on 13 Oct 2020

👍3

I didn't have time to finish figuring this out, but here's a thought from chat:

It sounds like, without untracked resources, Metal queues bear no relation to Vulkan and D3D12 queues and are purely a CPU-multithreading primitive?

Also trying to figure out whether untracked resources are identical to tracked resources if you're using only one queue[, or if you need to synchronized untracked resources even on the same queue.]

kainino0x on 14 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

WGSL: Proposal - square brackets for constructing an array

ben-clayton · 6Comments

Define "human writable" as "can be written by humans"

dneto0 · 6Comments

Proposal: synchronize unordered access views at pass boundaries

kvark · 5Comments

add a combined image sampler as Vulkan

yukunxie · 6Comments

Consider exposing WebGPU as an ES6 module?

litherum · 6Comments