Wgpu-rs: Every 103rd frame takes ~150ms to render.

Created on 11 Jun 2020  路  19Comments  路  Source: gfx-rs/wgpu-rs

OS Arch Linux
GPU: nvidia GTX 960
driver: nvidia proprietary drivers 440.82-18

Every ~1 second my application stutters, presumably due to creating many buffers per frame which have to be deallocated at some point.

I can reproduce this in the skybox example by addding

let mut buffers = vec!();
for i in 0..400 {
    buffers.push(device.create_buffer_with_data(&data, wgpu::BufferUsage::VERTEX));
}

Although in my application i'm only creating ~20 buffers per frame. Maybe it has to do with the size of the buffers, however I cant create large buffers without using them for testing purposes due to https://github.com/gfx-rs/wgpu-rs/issues/362

Maybe the answer is to just not create buffers each frame, but I feel like i'm not making many so it shouldnt be too bad?

bug

All 19 comments

On our side, this could be related to #261: if we aggressively deallocate memory, this would statter, and we shouldn't be doing that.

On your side, ideally, no resources are created per frame. Are you doing this to update/upload data? Please consider write_buffer and write_texture instead, which are much better for memory allocation, since they use internal linear allocator.

oh interesting, I hadn't seen write_buffer/write_texture before.
I was uploading my textures to a buffer first and then doing copy_buffer_to_texture.

Yes I am doing this to update/upload data that needs to change each frame.

Say for uniforms which clearly need new data for each frame.
Should I create a pool of say 100 of them? (allowing up to 100 things to be drawn in a frame)
Should each buffer in the pool be the size of the largest uniform? Or should there be a unique pool for each uniform size?
Do I need to worry about the buffer still being in use by the time the next frame is drawn?

Maybe we need an example to demonstrate how to go about rendering a variable number of objects without creating buffers every frame.

write_buffer and write_texture are very new, in both here and the WebGPU spec itself.
Our examples now use them for everything. Shadow example has multiple objects.

Should I create a pool of say 100 of them? (allowing up to 100 things to be drawn in a frame)

You can create a pool of buffers. Buffers can be larger than the uniform structures in your shader. You don't have to have a separate pool per type.

Or you can use a single buffer with varying offsets. In this case, the dynamic_offset should be true on the buffer binding.

Do I need to worry about the buffer still being in use by the time the next frame is drawn?

No. Think about write_buffer as if it creates a temporary buffer internally and issues a copy operation on the queue.

Ah!
If i'm regenerating the data in my uniforms each frame.
And I create one large buffer to hold all the uniforms.
Then it would be very cheap to deallocate/reallocate that buffer should it run out of space.
Ah, that does mean I'll need my drawing logic to return all the uniforms then upload them to the single buffer at the end. Seems doable though.
Thanks!

Regardless of how you upload the data, you need to make sure that the data for a render pass is all prepared. No transfers happen in the middle of a render pass.
So if the render pass has N objects requiring N uniform buffers, you need to somehow get them there. I think it's simplest when you know the higher bound on the number of objects - you'd then allocate a single buffer and use it every frame. I.e. as you are encoding the render pass, you keep track of the current offset into this buffer, and just issue write_buffer into it for each new piece of data.
Things become more complicated when you don't have any idea on the number of objects.

I changed my renderer to reuse a single uniform buffer (and recreate when its no longer large enough)
https://github.com/rukai/canon_collision/commit/d7457f5c7d302b2ea32e1c6dbe4b517328326ef8
I also commented out wgpu_glyph usage which was causing some allocations per frame.
I then verified that no allocations per frame are happening in the INFO logging

However this has not resolved the stuttering :/

So I've done some more investigation.
Every 103rd frame the queue.write_buffer takes ~90ms and the queue.submit` takes ~70ms (these high values vary a lot)
Normally the entire render process takes 500us
flamegraph wasn't any help here presumably because it doesn't pick up the spike.

Additionally the problem does not occur on windows.

Every 103rd frame the device.write_buffer takes ~90ms and the queue.submit` takes ~70ms (these high values vary a lot)

We need to know more about what's happening here. Could it be that we are allocating/freeing a chunk from the linear allocator?

out.txt
Here is a trace log, does this help?
I added println write_buffer after write_buffer that takes longer than 16ms
I also added println submit after submit that takes longer than 16ms.
so searching for those may help.

Please ignore the application logic debug output they are occuring in another thread.

Thank you! I'm not seeing anything suspicious in the logs. At this point, we need to inspect your app with a sampling profiler, and then zoom into that section where write_buffer takes longer than a few ms. Is there a repro setup for us? Or would you be able to try any sampling profiler yourself?

You can try compiling it following these instructions: https://github.com/rukai/canon_collision/blob/master/compiling.md (setup deps skipping the gtk setup, then follow "Compile and run the game")
or get a build produced by the CI https://canoncollision.com/builds (dont know if thats profileable)
If you dont have an xbox controller you will want to run with the command line arg -fToriel.cbor to force it start a game and start rendering things.

Otherwise let me know what "sampling profiler" you would recommend for linux and ill give it a go when I have time.

perf is the classy choice on Linux. Callgrind may help too, although I have less experience with it. I'll try to find out what happens there if I'm able to reproduce this.

Ah, so it turns out the flamegraph cargo tool uses perf internally.
I had tried it earlier but it didnt provide any useful information.

Now that I actually know what the problem functions are (queue.submit and queue.write_buffer)
I zoomed into them:
image

image
Still doesnt seem to reveal anything super useful though :/

I resorted to putting Instance.elapsed() everywhere and tracked down the queue.submit issue to NonReferencedResources::clean

This heaps.free is taking all the time.

    unsafe fn clean(
        &mut self,
        device: &B::Device,
        heaps_mutex: &Mutex<Heaps<B>>,
        descriptor_allocator_mutex: &Mutex<DescriptorAllocator<B>>,
    ) {
        let start = std::time::Instant::now();
        if !self.buffers.is_empty() {
            let mut heaps = heaps_mutex.lock();
            for (raw, memory) in self.buffers.drain(..) {
                log::trace!("Buffer {:?} is destroyed with memory {:?}", raw, memory);
                device.destroy_buffer(raw);
                println!("NonReferencedResources2: {:?}", start.elapsed());
                heaps.free(device, memory);
                println!("NonReferencedResources3: {:?}", start.elapsed());
            }
        }
  12: wgpu_core::device::life::NonReferencedResources<B>::clean
             at /home/rukai/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/libstd/macros.rs:13
  13: wgpu_core::device::life::LifetimeTracker<B>::cleanup
             at /home/rukai2/Projects/Crates/wgpu/wgpu/wgpu-core/src/device/life.rs:330
  14: wgpu_core::device::Device<B>::maintain
             at /home/rukai2/Projects/Crates/wgpu/wgpu/wgpu-core/src/device/mod.rs:306
  15: wgpu_core::device::queue::<impl wgpu_core::hub::Global<G>>::queue_submit
             at /home/rukai2/Projects/Crates/wgpu/wgpu/wgpu-core/src/device/queue.rs:537
  16: wgpu::backend::direct::<impl wgpu::Context for wgpu_core::hub::Global<wgpu_core::hub::IdentityManagerFactory>>::queue_submit
             at /home/rukai/.cargo/git/checkouts/wgpu-rs-40ea39809c03c5d8/16054f2/src/backend/direct.rs:18
  17: wgpu::Queue::submit
             at /home/rukai/.cargo/git/checkouts/wgpu-rs-40ea39809c03c5d8/16054f2/src/lib.rs:1850
  18: canon_collision::wgpu::WgpuGraphics::render
             at canon_collision/src/wgpu/mod.rs:775

Heres a stacktrace, line numbers will be a little off because of all the printlns I put everywhere.

Thank you @rukai that is extremely helpful!
It looks to be the same as #261 : gfx-memory is too eager to deallocate memory. It's somewhat surprising that de-allocation can take that much time though 馃
I think what's going on is that the driver has a GC-like pass over memory. It doesn't de-allocate an instant after we ask it to. But once in 103 frames, it decides to stop the world and deallocate everything.

I removed the urgent label as we have a much better idea about what's going on, and it's sorta a known issue. Is this a blocker for you?

Well its just a personal project so it doesn't matter much in that sense.
I would consider the stuttering as rendering the game unplayable on my computer.
But its still usable enough for testing, so I can still work on some things.

I've been researching how to fix the linear allocator, so you can assign this and 261 to me.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Lokathor picture Lokathor  路  3Comments

bvssvni picture bvssvni  路  5Comments

RazrFalcon picture RazrFalcon  路  3Comments

kvark picture kvark  路  3Comments

sagacity picture sagacity  路  3Comments