Wgpu: Every 676 frames, Queue::write_buffer takes ~25ms

Created on 2 Mar 2021 · 23Comments · Source: gfx-rs/wgpu

Description
Every 676 frames (varies by vsync / workload, but is exact within one set of settings), one Queue::write_buffer or Device::create_buffer call takes ~25ms. The submit call on the next frame also takes longer than usual, generally about 16ms.

See also: https://github.com/gfx-rs/wgpu-rs/issues/363, since it appears to be a similar issue with a previous allocator.

Repro steps
I don't have a minimal example and the code that I am experiencing this in is not public, but here's the general overview for reproduction.

My application allocates most memory at the start, and only rarely creates new buffers. There are about 10 write_buffer calls per frame, with reasonably small buffers for each. A random one of these calls takes 25ms on the spike frame. My application allocates a large amount of memory, which could be causing this. It allocates 1 or 2 128MiB vertex buffers, and sub allocates within those buffers to reduce vertex buffer swapping. There are not many individual buffers, so that shouldn't be the cause of the issue.

To find the spikes, I have been using tracing and tracing-tracy; this integrates with wgpu's tracing setup, so it can show some more detail.

Expected vs observed behavior
I would expect the frame time to be consistent, with no random or regular spikes. Instead, once every n frames, a single Queue::write_buffer call takes 25ms, and submitting the next frame after that frame (not the submit after the write, the second submit after the write...) takes ~10ms.

The n here varies based on whether vsync is enabled or the general frame time, but is very consistent. The number of frames between all frame spikes are the exact same, while the time may vary. Running a more minimal render and increasing the framerate as much as possible makes this more visible.

Extra materials
These images use the Tracy profiler, watching tracing's output at the trace level for all crates. I added a patch to gpu-alloc to add more tracing information - most of the time is spent in the backend allocate memory function, which I believe is provided in gfx-hal in this case.

Tracy inspection of a two frame spike:
Tracy view of the frame spike
Comparison to nearby normal frames:

Platform
GPU: GTX 970
OS: Manjaro Linux
Backend used: Vulkan
wgpu version: 0.7.0

performance help wanted bug

Source

Aeledfyr

❤1

Most helpful comment

As a user of Rend3, I'm one level up from this problem, but I definitely see it, and it has an impact. I'm writing a viewer for a virtual world, and load gigabytes of content into the GPU. One thread is just a refresh loop. Another thread is loading content. Frame rates look like this:

00217 frames over 01.00s. Min: 03.16ms; Average: 04.63ms; 95%: 06.15ms; 99%: 09.88ms; Max: 105.19ms; StdDev: 06.91ms
00189 frames over 01.00s. Min: 03.42ms; Average: 05.30ms; 95%: 06.06ms; 99%: 105.38ms; Max: 105.38ms; StdDev: 10.41ms
00231 frames over 01.00s. Min: 03.38ms; Average: 04.33ms; 95%: 05.06ms; 99%: 06.32ms; Max: 108.59ms; StdDev: 06.90ms
00152 frames over 01.01s. Min: 03.41ms; Average: 06.66ms; 95%: 18.86ms; 99%: 119.01ms; Max: 119.01ms; StdDev: 12.81ms
00135 frames over 01.00s. Min: 03.39ms; Average: 07.43ms; 95%: 18.30ms; 99%: 107.28ms; Max: 107.28ms; StdDev: 10.04ms
00235 frames over 01.00s. Min: 03.38ms; Average: 04.27ms; 95%: 04.94ms; 99%: 07.67ms; Max: 107.22ms; StdDev: 06.76ms
00187 frames over 01.00s. Min: 03.04ms; Average: 05.36ms; 95%: 06.80ms; 99%: 111.21ms; Max: 111.21ms; StdDev: 10.92ms
00226 frames over 01.00s. Min: 03.33ms; Average: 04.43ms; 95%: 05.55ms; 99%: 11.14ms; Max: 108.14ms; StdDev: 06.97ms
00235 frames over 01.00s. Min: 03.35ms; Average: 04.27ms; 95%: 05.42ms; 99%: 07.64ms; Max: 110.49ms; StdDev: 06.97ms
00225 frames over 01.00s. Min: 03.43ms; Average: 04.46ms; 95%: 05.37ms; 99%: 12.77ms; Max: 111.15ms; StdDev: 07.19ms
00195 frames over 01.00s. Min: 03.18ms; Average: 05.14ms; 95%: 05.54ms; 99%: 113.05ms; Max: 113.05ms; StdDev: 10.73ms
00217 frames over 01.00s. Min: 03.29ms; Average: 04.62ms; 95%: 05.21ms; 99%: 17.83ms; Max: 104.99ms; StdDev: 06.91ms

Notice the stalls. Average around 5ms, 95% of frames around 5ms, max around 100ms! Those huge stalls are a big drag on the user experience.

(Plenty of CPU time available; 6 cores and under 25% total CPU utilization.)

John-Nagle on 30 Mar 2021

👍2

All 23 comments

I have also tested this on a Linux machine with an AMD gpu with similar results (the allocations take 9ms rather than 25ms), so this is most likely not a driver/gpu issue.

I have also created an example case from the wgpu-rs cube example: https://github.com/Aeledfyr/wgpu-example. The spikes are rarer, but it consistently has spikes about ~20-30s apart on my machine. (I disabled vsync to make the spikes visible; otherwise finding them would be a pain).

The modifications to the example are:
Allocate a large (128MiB) buffer marked as VERTEX and COPY_DST.
In the render function, call queue.write_buffer to overwrite the vertex and index buffers 10 times.
The spikes do not appear to occur if the large allocation was not performed.

Aeledfyr on 2 Mar 2021

Thank you for filing this beautiful issue!
You found that allocate_memory is the problem. Good news is - we don't expect this to be called at all in a use case where the uploads are done regularly, unless the amount of uploads exceeded some threshold. So we'll need to debug (gpu-alloc in particular) and see why exactly we ended up allocating new memory in these spikes.

kvark on 2 Mar 2021

❤2

I may have this problem, but am not sure yet. I'm getting brief stalls, as long as 160ms, from a Rust program atop Rend3 atop wgpu. One thread is running the refresh loop, which does little else. Another thread is loading content, allocating GPU memory, and adding textures, materials, and objects via Rend3. All this is in Rust. On a complex scene, the normal frame rate is around 200 FPS, but every 1-2 secs then there's a stutter, with one frame taking far too long. This only happens during content loading; once all content is loaded, there is no more stuttering. Loading larger vertex buffers (64K vertices) seems to make it worse.

So the symptom is the same, but I have not done any profiling to confirm the cause.

(6 CPUs, 12 hardware threads, AMD Ryzen 5, NVidia 3070 8GB, 32 GB of RAM, Ubuntu 20.04 LTS)

John-Nagle on 22 Mar 2021

I also reproduce this with Vulkan and WebGL backends.

VincentFTS on 22 Mar 2021

So we now have only two allocate_memory calls, but the second is still annoying. It is due to the current Linear Allocator algorithm.
I suggested to allocate two chunks directly, that was rejected by @zakarumych .
@kvark do you have suggestions ?

VincentFTS on 30 Mar 2021

I left that issue open to not forget to think about it.
Maybe treating one memory object as chunk pair and reusing first half if it's free when the second one is exhausted.

zakarumych on 30 Mar 2021

Do you think it would be to complex to keep track of deallocated regions inside a chunk to reuse them directly ?

VincentFTS on 30 Mar 2021

Maybe I could keep sorted list of free regions, find suitable region and cut it on allocation and merge on deallocation.
It would be easily fragmented, but user promises to deallocated all blocks shortly, so fragmentation should not be an issue.

zakarumych on 30 Mar 2021

user promises to deallocated all blocks shortly

What do you mean ?
Is it a requirement ?

VincentFTS on 30 Mar 2021

There's gpu_alloc::UsageFlags::TRANSIENT flag. It can be set as a hint that this allocation is short-living.
wgpu uses it for particular type of allocations, for example for staging buffers for uploads. Exactly the case of Queue::write_buffer
gpu-alloc uses LinearAllocator only if allocation request contains this flag.
For long-lived allocations another allocator is used, which avoids fragmentation and can reuse individual allocated blocks, but have a bit of memory overhead.

zakarumych on 30 Mar 2021

Good, so no need to track freed regions, your proposal of splitting in a chunk pair seems great !

VincentFTS on 30 Mar 2021

00217 frames over 01.00s. Min: 03.16ms; Average: 04.63ms; 95%: 06.15ms; 99%: 09.88ms; Max: 105.19ms; StdDev: 06.91ms
00189 frames over 01.00s. Min: 03.42ms; Average: 05.30ms; 95%: 06.06ms; 99%: 105.38ms; Max: 105.38ms; StdDev: 10.41ms
00231 frames over 01.00s. Min: 03.38ms; Average: 04.33ms; 95%: 05.06ms; 99%: 06.32ms; Max: 108.59ms; StdDev: 06.90ms
00152 frames over 01.01s. Min: 03.41ms; Average: 06.66ms; 95%: 18.86ms; 99%: 119.01ms; Max: 119.01ms; StdDev: 12.81ms
00135 frames over 01.00s. Min: 03.39ms; Average: 07.43ms; 95%: 18.30ms; 99%: 107.28ms; Max: 107.28ms; StdDev: 10.04ms
00235 frames over 01.00s. Min: 03.38ms; Average: 04.27ms; 95%: 04.94ms; 99%: 07.67ms; Max: 107.22ms; StdDev: 06.76ms
00187 frames over 01.00s. Min: 03.04ms; Average: 05.36ms; 95%: 06.80ms; 99%: 111.21ms; Max: 111.21ms; StdDev: 10.92ms
00226 frames over 01.00s. Min: 03.33ms; Average: 04.43ms; 95%: 05.55ms; 99%: 11.14ms; Max: 108.14ms; StdDev: 06.97ms
00235 frames over 01.00s. Min: 03.35ms; Average: 04.27ms; 95%: 05.42ms; 99%: 07.64ms; Max: 110.49ms; StdDev: 06.97ms
00225 frames over 01.00s. Min: 03.43ms; Average: 04.46ms; 95%: 05.37ms; 99%: 12.77ms; Max: 111.15ms; StdDev: 07.19ms
00195 frames over 01.00s. Min: 03.18ms; Average: 05.14ms; 95%: 05.54ms; 99%: 113.05ms; Max: 113.05ms; StdDev: 10.73ms
00217 frames over 01.00s. Min: 03.29ms; Average: 04.62ms; 95%: 05.21ms; 99%: 17.83ms; Max: 104.99ms; StdDev: 06.91ms

Notice the stalls. Average around 5ms, 95% of frames around 5ms, max around 100ms! Those huge stalls are a big drag on the user experience.

(Plenty of CPU time available; 6 cores and under 25% total CPU utilization.)

John-Nagle on 30 Mar 2021

👍2

I added experimental allocation strategy that can be enabled with feature "freelist" on version 0.4.2
Currently it will replace LinearAllocator with FreeListAllocator which can reuse individual memory regions, and merge them.
Without adding anything to config it'll just keep at least 2*linear_chunk of memory preallocated.
And if memory consumption is low, only one chunk of size linear_chunk will be allocated.

zakarumych on 30 Mar 2021

Regarding loading gigabytes of data to the GPU, allocator configuration is required to keep more memory preallocated. Or some sophisticated guessing, about what memory could be required again soon.

zakarumych on 30 Mar 2021

@kvark will wgpu use this freelist feature ?

VincentFTS on 31 Mar 2021

I'm still expecting this to be fully abstracted away by gpu-alloc. If we were to start manually keep memory chunks, we'd then basically start re-implementing gpu-alloc internally.

kvark on 31 Mar 2021

If I understand well what @zakarumych said, it's just a feature to add, nothing more.

VincentFTS on 31 Mar 2021

Oh, ok. We'd use whatever gpu-alloc provides, of course.

kvark on 31 Mar 2021

@VincentFTS as this is a feature, you can enable it in your crate, without changes in wgpu. Just add gpu-alloc to your dependencies with feature enabled.
One we confirm that FreeListAllocator works fine, I'll just make it on by default and add fields into config to control when it shall be used.

zakarumych on 31 Mar 2021

@zakarumych sorry for this late answer …
I tried to activate the freelist feature, and I get a segmentation fault.
thread 'main' panicked at 'attempt to subtract with overflow' in gpu-alloc/src/freelist.rs:140:29

VincentFTS on 26 Apr 2021

Try to use latest commit on git, see if it helps

zakarumych on 26 Apr 2021

I already use it

VincentFTS on 26 Apr 2021

Marked this for 0.8 release. If the changes are in gpu-alloc, they'd naturally be picked up because we'll require gpu-alloc to be published.

kvark on 26 Apr 2021

🎉1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

assertion failed: wgpu-core/src/track/mod.rs:373:21

fintelia · 14Comments

Garbled framebuffer output on Nvidia

fintelia · 23Comments

Window alpha transparency support

unrelentingtech · 14Comments

WebGPU "Queries" implementation (for profiling purposes)

z2oh · 13Comments

Support for arrays of textures

cloudhead · 15Comments