Occasionally (every few hundred frames on average) I get VUID-vkDestroyFramebuffer-framebuffer-00892 validation errors. The timing seems irregular, so I guess something is racing under the hood.
My render loop is quite straighforward (removed error handling):
let frame = swap_chain.lock().await.get_next_frame().unwrap()
...
let mut encoder = device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
...
let mut rpass = geometry_pass.begin(&mut encoder, &frame.output.view);
...
let mut rpass = lighting_pass.begin(&mut encoder, &frame.output.view);
...
queue.submit(Some(encoder.finish()));
I played around a little and adding:
std::thread::sleep(std::time::Duration::from_millis(10));
drop(frame);
at the end prevents the validation errors.
Would you have a repro case?
I can try to record a trace. Does that help?
Yes, an API trace would be wonderful, plus your adapter info for the reference (we should include it in the trace).
I attached the vulkaninfo output and the trace with current master (had to compress it twice, because github doesn't like .xz and just using zip makes it 20x larger).
lspci says: Advanced Micro Devices, Inc. [AMD/ATI] Navi 14 [Radeon RX 5500/5500M / Pro 5500M] (rev c5)
trace.zip
vulkaninfo.txt
@FlorianUekermann thank you! What wgpu-rs (or wgpu) revision was that from?
wgpu-rs master at time of posting. I updated right before recording. I can't check right now.
I'm having trouble replaying this API trace due to API mismatches. Could you specify the exact revision (of either wgpu-rs or wgpu) this was taken from?
After a bit more wrangling, I'm able to replay this API trace. It's panicking on old.is_none() in the HUB registry for buffers. No validation errors before that.
I'll play around with some traces on the weekend. Is it possible that a race during recording has a different outcome when being replayed?
I don't currently see how this would be possible with regards to the buffer IDs. We record Action::DestroyBuffer right before we free the corresponding ID in the hub, and until we do that nothing can technically allocate this ID for a new buffer.
This is still fairly fresh tech though, so I might be missing something
This still happens on latest wgpu-rs master.
I fixed my panic on shutdown and tried again on wgpu-rs master, which uses wgpu b35209398823e95d71cce718fd98fa54ce43271d right now. Unfortunately I can't play back the traces with that revision:
$ RUST_LOG=info cargo run --features winit -- ../../../trace/
[0.009707 INFO]()(no module): Loading trace '"../../../trace/"'
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Parser(Eof, Position { col: 1, line: 4780556 })', player/src/bin/play.rs:42:70
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Am I using the player wrong? Its not really a problem right now, so I can try again in a few weeks.
Btw, when shutting down I can also produce the occasional segfault and a few other validation errors. Maybe it is worth mentioning that I am rendering in a separate task, not the winit event loop.
Edit: Another observation - The validation error happens a lot more frequently when I am not recording a trace.
The trace can't be replayed because it wasn't finished gracefully, since the app has panicked. All you need to finish it is to edit the file by appending "]" to it.
I wonder if this is just https://github.com/gfx-rs/gfx/issues/3184
All you need to finish it is to edit the file by appending "]" to it.
Ah thanks, now it works. I've tried with a few traces now and I don't get the validation error when playing the trace, even though I get multiple on the original run...
I've been playing around a little more and noticed that I get very noticeable stalls (~1s) right before the validation error prints. introducing a sleep before acquiring a new frame gets rid of the validation error as well as the stalls.
I wonder if this is just gfx-rs/gfx#3184
I hadn't seen that. Seems likely.
There aren't really any examples that are heavy enough to get frames in flight for long. I'll see if I can patch one to reproduce this issue.
Filed #861 to make this less painful
I modified the hello-triangle example to have trivial reproducer for this. If you think a stress test example is generally useful I can send a pull request:
https://github.com/FlorianUekermann/wgpu-rs
I get the error every few ms with: RUST_LOG=info cargo run --features subscriber --example hello-triangle -- --stress
Btw, it took me longer than I care to admit to figure out how to run the example with logs. Maybe that should be documented in a very obvious place (README?).
Btw, it took me longer than I care to admit to figure out how to run the example with logs. Maybe that should be documented in a very obvious place (README?).
Sorry about that! I agree, we should definitely put it into README. @cwfitzgerald could you make a small PR with this to wgpu-rs, please?
I'm running into this validation error as well. I see the validation error go away after applying one of the following changes:
queue.submit as mentioned in https://github.com/gfx-rs/wgpu/issues/785#issue-655407567 (10ms was too short for me, 30ms worked though). EDIT: Adding the sleep before queue.submit also works for me. The location of the sleep doesn't seem to matter too much. Also it might be relevant that device.poll takes about 30ms for me as well, or maybe that's just a coincidence.device.poll(wgpu::Maintain::Wait) before queue.submit.I also tried calling device.poll _after_ queue.submit, which did get rid of the validation error mentioned in this issue, but triggered a different one instead: UNASSIGNED-CoreValidation-DrawState-QueueForwardProgress. The result was the same when calling device.poll in a loop in a separate thread. I could make a separate issue for that?
Most helpful comment
I'm running into this validation error as well. I see the validation error go away after applying one of the following changes:
queue.submitas mentioned in https://github.com/gfx-rs/wgpu/issues/785#issue-655407567 (10ms was too short for me, 30ms worked though). EDIT: Adding the sleep beforequeue.submitalso works for me. The location of the sleep doesn't seem to matter too much. Also it might be relevant thatdevice.polltakes about 30ms for me as well, or maybe that's just a coincidence.device.poll(wgpu::Maintain::Wait)beforequeue.submit.I also tried calling
device.poll_after_queue.submit, which did get rid of the validation error mentioned in this issue, but triggered a different one instead:UNASSIGNED-CoreValidation-DrawState-QueueForwardProgress. The result was the same when callingdevice.pollin a loop in a separate thread. I could make a separate issue for that?