Related PRs:
https://github.com/gfx-rs/wgpu-rs/pull/155
https://github.com/gfx-rs/wgpu-rs/pull/168
For context, I was looking at updating imgui-wgpu to work with the current master.
Now that set_index_buffer (and similar methods) take the a &'a Buffer instead of &Buffer, it has a ripple effect of "infecting" the surrounding scope with this lifetime.
While attempting to iterate and add draw calls, you end up with either "multiple mutable borrow" errors on wherever you're storing the Buffer or similar lifetime errors like "_data from self flows into rpass here_"
I would think that by calling set_index_buffer (and similar methods) that it would increment the ref-count on the buffer so that this lifetime restriction isn't needed.
This is indeed a feature that makes it more difficult to use. What we gain from it is very lightweight render pass recording. Explanation follows.
wgpu is designed to fully support multi-threading. We do this by having whole storages of different objects locked upon access. So generally, touching anything has a CPU cost. If we had to access each of the refcounts of the resources used by render commands, we'd be paying that cost during the render pass recording, which is hot code. Now with the recent changes, recording a pass is very lightweight, no locks involved.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic. Generally, unless you are creating many resources during recording, it's easy to work around. If you are doing that, you aren't on a good path performance wise, and should consider creating one big buffer instead per frame.
Another way to possibly address this is to have wgpu-rs ensuring the lifetimes of the objects by other means, like keeping a refcount (note: talking about wgpu-rs here, not wgpu, which already keeps a refcount, but we'd need to lock an object storage to access it). This is far fetched and not something I'm excited about :)
head's up to @mitchmindtree, who is on 0.4 and will face this issue soon. It would be good to know how much this would affect their case.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.
I think the issue might be a bit more severe. I put the buffers into a Vec that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer, the vec lifetime becomes "linked" to the RenderPass<'render> lifetime in a mutable borrow. This prevents accessing it again.
Judging by #155 and #168 I don't imagine it should affect us a great deal - most of the abstractions in nannou take a &mut CommandEncoder and encode the whole render pass within the scope of a single function, e.g.
UI render pass https://github.com/nannou-org/nannou/blob/master/src/ui.rs#L451Draw API render pass https://github.com/nannou-org/nannou/blob/master/src/draw/backend/wgpu/mod.rs#L179TextureReshaper render pass https://github.com/nannou-org/nannou/blob/master/src/wgpu/texture/reshaper/mod.rs#L117These are just a few examples - generally all of these are submitted on a single CommandBuffer once per frame. Most nannou users might not use all these abstractions in a single sketch/app though.
Anyway, I hope I'm not speaking too soon as I haven't tried updating yet. There are some other things I'd like to address in nannou first, but I'll report back once I get around to it.
Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.
I think the issue might be a bit more severe. I put the buffers into a
Vecthat would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) fromset_index_buffer, the vec lifetime becomes "linked" to theRenderPass<'render>lifetime in a mutable borrow. This prevents accessing it again.
Yes. So the good news is - this is a not the best pattern to follow as a use case: creating buffers as you are recording a pass. Would it be possible for you to refactor the code in a way that first figures out how much space is needed for, say, all indices in a pass, creating a single buffer, and then using it through the pass?
I also hit this issue in the game engine I've been building. It has a relatively complex "Render Graph" style api and I spent a solid ~15 hours over the last week refactoring everything to account for the new lifetime requirements. In general I solved the problem the same way @kvark outlined. Although I really wish I had found this thread before coming to the same conclusion via trial and error :smile: .
My final solution was:
My takeaways from this (honestly very painful) experience:
That being said, I understand why these lifetime changes were made _and_ I think the wgpu-rs team made the right call here. I certainly don't want this to read as a complaint. I _deeply_ appreciate the work @kvark and the wgpu team has done here. I just want to add my experience as a data point.
Thank you for feedback @cart !
Just wanted to add that this is all being evaluated. We aren't completely sure if these lifetimes are a good idea. It's certainly the easiest for wgpu to work with, but I totally agree that it could cause headaches for the users... and it does.
The good thing here is that wgpu-rs is just a Rust idiomatic wrapper around wgpu, which is a C API and it doesn't have explicit lifetimes (although, same lifetimes are required implicitly). So what we could do is having others pass variants, e.g. ArcRenderPass and ArcComputePass, which would work similarly but receive Arc<> in their parameters and store the references inside, e.g.:
struct ArcRenderPass<'a> {
id: wgc::id::RenderPassId,
_parent: &'a mut CommandEncoder,
used_buffers: Vec<Arc<Buffer>>,
}
impl ArcRenderPass<'_> {
fn set_vertex_buffer(&mut self, slot: u32, buffer: &Arc<Buffer>, offset: BufferOffset) {
self.used_buffers.push(Arc::clone(buffer));
unsafe {
wgn::wgpu_render_pass_set_vertex_buffer(
self.id.as_mut().unwrap(),
slot,
buffer.id,
offset,
)
};
}
}
These passes could be used interchangeably with the current ones and trade the life time restrictions to a bit of run-time overhead for the Arc. We could go further and try to encapsulate the thing that keeps track of the resources, which you can only append to. There is a lot of ways to be fancy and lazy here :)
Ooh I think I like the "multiple pass variants" idea because it gives people the choice of "cognitive load vs runtime cost". The downsides I can see are:
On the other hand, the "zero cost abstraction" we have currently feels more in line with the Rust mindset and I'm sure many people would prefer it. I'm also in the weird position where I'm over the "migration hump" and now I really want a zero cost abstraction. Its hard for me to be objective here :smile:
I think this could be solved with either:
If I had to pick one today, I think I would go for (2). Rather than complicating the api surface / being forced to support that forever and document it clearly, just see if additional docs and examples for the "zero cost lifetimes" solves the problem well enough for users. If this continues to be a problem you can always add the variant(s). Removing apis is harder on users than adding features, so I think it makes sense to bias toward a smaller api.
The other interesting aspect is that in this use case, we don't care about the Buffer lifetime in terms of it's memory location. We only care that the Drop impl does not run. The difference is subtle but it opens up some other possibilities. For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.
While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc? Are there guards in WGPU core/native that protect against this? If so, does it make sense to have guards in this case, but not in the case of a buffer being dropped before the pass is finished recording?
For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.
Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).
While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc?
Generally, we have all the objects refcounted, and you don't lose the device just because you drop it. The only exception really is render/compute pass recording, where we only want to work with ID and not go into the objects themselves (until the recording is finished) to bump the refcounts.
Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).
This would appear the simplest option to me. It can probably even be done without breaking changes:
pub enum BufferOwnedOrRef<'a> {
Owned(Buffer),
Ref(&'a Buffer),
}
impl<'a> From<Buffer> for BufferOwnedOrRef<'a> {
fn from(b: Buffer) -> Self {
BufferOwnedOrRef::Owned(b)
}
}
impl<'a> From<&'a Buffer> for BufferOwnedOrRef<'a> {
fn from(b: &'a Buffer) -> Self {
BufferOwnedOrRef::Ref(b)
}
}
pub fn set_vertex_buffer<'a, B: Into<BufferOwnedOrRef<'a>>(
&mut self,
slot: u32,
buffer: B,
offset: BufferAddress,
size: BufferAddress
)
@dhardy yes, we could. I hesitate, however, because I see the value in not promoting the code path where the user creates resources in the middle of a render pass. It's an anti-pattern. The only reason that could make this path appealing today is because updating GPU data is hard.
Here is what needs to happen (ideally) when you are creating a new vertex buffer with data:
Now, imagine you already have a buffer that is big enough(!). That would spare you (3) but otherwise follow the same steps. Therefore, there is no reason for us to make it easy to create new buffers, even if you are replacing all the contents of something. It's always more efficient to use an existing one.
The only caveat is - what if you need a bigger buffer? Let's see if this becomes a blocker.
For the data uploads, the group is still talking about the ways to do it. Hopefully, soon...
FYI, you can emulate the ArcRenderPass API using arenas in user space, and it should basically be just as efficient as the equivalent WGPU API (unless core implementation details change a lot to increment internal reference counts before the RenderPass is dropped).
struct ArcRenderPass<'a> {
arena: &'a TypedArena<Arc<Buffer>>,
render_pass: RenderPass<'a>
}
impl<'a> ArcRenderPass<'a> {
fn set_vertex_buffer(&mut self, slot: u32, buffer: Arc<Buffer>, offset: BufferOffset) {
let buffer = self.arena.alloc(buffer);
self.render_pass.set_vertex_buffer(slot, buffer, offset);
}
}
fn blah<'a>(encoder: &'a mut CommandEncoder) {
let arena = TypedArena::new();
let arc_render_pass = ArcRenderPass {
arena,
render_pass: encoder.begin_render_pass(..),
};
// ... Do stuff; you can pass around &mut ArcRenderPass and call set_vertex_buffer on owned `Arc`s.
}
@pythonesque it would be wonderful if we had that used by one of the examples. Would you mind doing a PR for this? We'd then be able to point users to working code instead of this snippet.
Just to provide another data point, I hit this issue as well.
Consider that I want my user to be able to simply call an API to render high-level objects without worrying about details of which buffers to use. There are 2 options:
1) Define a buffer size upfront. If they exceed it at any point when issuing a high-level render call, internally end the current render pass, and then set up the same render pass again so we can reuse the same buffer.
2) Instead of ending the render pass, we create buffers dynamically within the same render-pass to avoid setting up identical states.
I'm not sure which is more performant. With option (1), it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism (unless I force the user to go full async and/or use double-buffering). With option (2), we're infinitely-buffered but pays for the cost of allocations.
Ultimately, with the current lifetime constraints, option (2) is not possible. So we're forced to go for option (1).
As a side point, it is a little clunky using the current buffer mapping API to go with option (1). I referred to #9 and saw this advice from @kvark:
That's why the current rough and effective way to update data is to go through create_buffer_mapped.
which seemed to contradict the approach to its core.
@DefinitelyNotRobot I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:
it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism
Also, this part:
As a side point, it is a little clunky using the current buffer mapping API to go with option (1).
This issue #9 is actually no longer a problem. The upstream WebGPU API went in this direction, and it's a part of wgpu-0.6.
Did you consider using the TypedArena<Arc<wgpu::Buffer>> and stuff like https://github.com/gfx-rs/wgpu-rs/issues/188#issuecomment-631143941 suggests?
I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:
Sorry about that. I was trying to illustrate the 2 designs that I could go with my API + their pros/cons and meant to say that option (2) was not even considerable because of RenderPass's lifetime constraints.
Did you consider using the
TypedArena<Arc<wgpu::Buffer>>and stuff like #188 (comment) suggests?
I was hesitant because that would mean an allocation for TypedArena every render call but on second thought, it seems I could haul it out and store it in my renderer object instead. I'll give that a try.
Most helpful comment
I also hit this issue in the game engine I've been building. It has a relatively complex "Render Graph" style api and I spent a solid ~15 hours over the last week refactoring everything to account for the new lifetime requirements. In general I solved the problem the same way @kvark outlined. Although I really wish I had found this thread before coming to the same conclusion via trial and error :smile: .
My final solution was:
My takeaways from this (honestly very painful) experience:
That being said, I understand why these lifetime changes were made _and_ I think the wgpu-rs team made the right call here. I certainly don't want this to read as a complaint. I _deeply_ appreciate the work @kvark and the wgpu team has done here. I just want to add my experience as a data point.