Wgpu-rs: The recent lifetime changes to `RenderPass` make it difficult to use.

Created on 4 Mar 2020 · 16Comments · Source: gfx-rs/wgpu-rs

For context, I was looking at updating imgui-wgpu to work with the current master.

Now that set_index_buffer (and similar methods) take the a &'a Buffer instead of &Buffer, it has a ripple effect of "infecting" the surrounding scope with this lifetime.

While attempting to iterate and add draw calls, you end up with either "multiple mutable borrow" errors on wherever you're storing the Buffer or similar lifetime errors like "_data from self flows into rpass here_"

I would think that by calling set_index_buffer (and similar methods) that it would increment the ref-count on the buffer so that this lifetime restriction isn't needed.

bug

Source

aloucks

👍1

Most helpful comment

I also hit this issue in the game engine I've been building. It has a relatively complex "Render Graph" style api and I spent a solid ~15 hours over the last week refactoring everything to account for the new lifetime requirements. In general I solved the problem the same way @kvark outlined. Although I really wish I had found this thread before coming to the same conclusion via trial and error :smile: .

My final solution was:

separate the "gpu resources" container out from the renderer container. i'm not sure this was absolutely necessary, but it helped me reason about the lifetimes. i also think its a better design.
do all resource creation / write operations before starting a render pass. Previously I allowed my "render pass" / "draw target" abstractions to create resources during the execution of a render pass. i broke this up into a "setup" phase and a "draw" phase. I removed all mutability from the "draw" phase.

My takeaways from this (honestly very painful) experience:

I like to think I have a strong understanding of rust lifetimes and I consider myself "proficient" at handling most lifetime issues. In general I find the compiler's lifetime error messages extremely helpful, but this class of lifetime error was almost impossible to debug. I wasn't able to solve the problem via the hints from the compiler as I normally do, but rather via a vague notion of the "intent" behind the new lifetimes and what that could mean for my program.
Breaking changes to a dependency's lifetimes can have massive effects on a project's architecture. This is especially problematic when there are multiple lifetime changes in the dependency's api because you don't see _all_ lifetime errors when a compile fails. Instead you need to solve the first problem the compiler hits (sometimes via re-architecting), then the second, etc until all lifetime issues have been accounted for. I was forced to re-architect multiple times because at each attempt, I had an incomplete picture of the lifetime constraints I was solving for. The fact that the compiler can't give me a "full picture" of the lifetime constraints I need to solve actually makes me really uneasy about updating rust dependencies and accounting for api changes. Maybe the rust compiler could be improved somehow to make this type of problem solving easier? But I honestly have no clue what that would look like. This "constraint solving" problem would also exist if I was starting my project from scratch on the current wgpu-rs master branch. I honestly just hope this is a niche problem that I never see on this scale again.

That being said, I understand why these lifetime changes were made _and_ I think the wgpu-rs team made the right call here. I certainly don't want this to read as a complaint. I _deeply_ appreciate the work @kvark and the wgpu team has done here. I just want to add my experience as a data point.

cart on 9 Mar 2020

👍5

All 16 comments

This is indeed a feature that makes it more difficult to use. What we gain from it is very lightweight render pass recording. Explanation follows.

wgpu is designed to fully support multi-threading. We do this by having whole storages of different objects locked upon access. So generally, touching anything has a CPU cost. If we had to access each of the refcounts of the resources used by render commands, we'd be paying that cost during the render pass recording, which is hot code. Now with the recent changes, recording a pass is very lightweight, no locks involved.

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic. Generally, unless you are creating many resources during recording, it's easy to work around. If you are doing that, you aren't on a good path performance wise, and should consider creating one big buffer instead per frame.

Another way to possibly address this is to have wgpu-rs ensuring the lifetimes of the objects by other means, like keeping a refcount (note: talking about wgpu-rs here, not wgpu, which already keeps a refcount, but we'd need to lock an object storage to access it). This is far fetched and not something I'm excited about :)

head's up to @mitchmindtree, who is on 0.4 and will face this issue soon. It would be good to know how much this would affect their case.

kvark on 4 Mar 2020

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.

I think the issue might be a bit more severe. I put the buffers into a Vec that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer, the vec lifetime becomes "linked" to the RenderPass<'render> lifetime in a mutable borrow. This prevents accessing it again.

aloucks on 4 Mar 2020

Judging by #155 and #168 I don't imagine it should affect us a great deal - most of the abstractions in nannou take a &mut CommandEncoder and encode the whole render pass within the scope of a single function, e.g.

The UI render pass https://github.com/nannou-org/nannou/blob/master/src/ui.rs#L451
The Draw API render pass https://github.com/nannou-org/nannou/blob/master/src/draw/backend/wgpu/mod.rs#L179
The TextureReshaper render pass https://github.com/nannou-org/nannou/blob/master/src/wgpu/texture/reshaper/mod.rs#L117

These are just a few examples - generally all of these are submitted on a single CommandBuffer once per frame. Most nannou users might not use all these abstractions in a single sketch/app though.

Anyway, I hope I'm not speaking too soon as I haven't tried updating yet. There are some other things I'd like to address in nannou first, but I'll report back once I get around to it.

mitchmindtree on 4 Mar 2020

👍1

Most of the dependent resources are meant to outlive the pass anyway. Only a few, like the index buffers you create dynamically, become problematic.

I think the issue might be a bit more severe. I put the buffers into a Vec that would persist between calls with the intent of clearing it right before the next recording. As soon as you reference the buffer in the vec (or where ever) from set_index_buffer, the vec lifetime becomes "linked" to the RenderPass<'render> lifetime in a mutable borrow. This prevents accessing it again.

Yes. So the good news is - this is a not the best pattern to follow as a use case: creating buffers as you are recording a pass. Would it be possible for you to refactor the code in a way that first figures out how much space is needed for, say, all indices in a pass, creating a single buffer, and then using it through the pass?

kvark on 4 Mar 2020

My final solution was:

separate the "gpu resources" container out from the renderer container. i'm not sure this was absolutely necessary, but it helped me reason about the lifetimes. i also think its a better design.
do all resource creation / write operations before starting a render pass. Previously I allowed my "render pass" / "draw target" abstractions to create resources during the execution of a render pass. i broke this up into a "setup" phase and a "draw" phase. I removed all mutability from the "draw" phase.

My takeaways from this (honestly very painful) experience:

I like to think I have a strong understanding of rust lifetimes and I consider myself "proficient" at handling most lifetime issues. In general I find the compiler's lifetime error messages extremely helpful, but this class of lifetime error was almost impossible to debug. I wasn't able to solve the problem via the hints from the compiler as I normally do, but rather via a vague notion of the "intent" behind the new lifetimes and what that could mean for my program.
Breaking changes to a dependency's lifetimes can have massive effects on a project's architecture. This is especially problematic when there are multiple lifetime changes in the dependency's api because you don't see _all_ lifetime errors when a compile fails. Instead you need to solve the first problem the compiler hits (sometimes via re-architecting), then the second, etc until all lifetime issues have been accounted for. I was forced to re-architect multiple times because at each attempt, I had an incomplete picture of the lifetime constraints I was solving for. The fact that the compiler can't give me a "full picture" of the lifetime constraints I need to solve actually makes me really uneasy about updating rust dependencies and accounting for api changes. Maybe the rust compiler could be improved somehow to make this type of problem solving easier? But I honestly have no clue what that would look like. This "constraint solving" problem would also exist if I was starting my project from scratch on the current wgpu-rs master branch. I honestly just hope this is a niche problem that I never see on this scale again.

cart on 9 Mar 2020

👍5

Thank you for feedback @cart !
Just wanted to add that this is all being evaluated. We aren't completely sure if these lifetimes are a good idea. It's certainly the easiest for wgpu to work with, but I totally agree that it could cause headaches for the users... and it does.
The good thing here is that wgpu-rs is just a Rust idiomatic wrapper around wgpu, which is a C API and it doesn't have explicit lifetimes (although, same lifetimes are required implicitly). So what we could do is having others pass variants, e.g. ArcRenderPass and ArcComputePass, which would work similarly but receive Arc<> in their parameters and store the references inside, e.g.:

struct ArcRenderPass<'a> {
    id: wgc::id::RenderPassId,
    _parent: &'a mut CommandEncoder,
    used_buffers: Vec<Arc<Buffer>>,
}

impl ArcRenderPass<'_> {
  fn set_vertex_buffer(&mut self, slot: u32, buffer: &Arc<Buffer>, offset: BufferOffset) {
    self.used_buffers.push(Arc::clone(buffer));
    unsafe {
            wgn::wgpu_render_pass_set_vertex_buffer(
                self.id.as_mut().unwrap(),
                slot,
                buffer.id,
                offset,
            )
        };
  }
}

These passes could be used interchangeably with the current ones and trade the life time restrictions to a bit of run-time overhead for the Arc. We could go further and try to encapsulate the thing that keeps track of the resources, which you can only append to. There is a lot of ways to be fancy and lazy here :)

kvark on 9 Mar 2020

Ooh I think I like the "multiple pass variants" idea because it gives people the choice of "cognitive load vs runtime cost". The downsides I can see are:

larger api surface
it raises questions like "which pass variant do you put into tutorials" / "what variant should you steer people to by default"
educating users about why there are two and when they should choose one over the other. this lifetime problem is hard to wrap your head around until you have a full understanding of the system.

On the other hand, the "zero cost abstraction" we have currently feels more in line with the Rust mindset and I'm sure many people would prefer it. I'm also in the weird position where I'm over the "migration hump" and now I really want a zero cost abstraction. Its hard for me to be objective here :smile:

I think this could be solved with either:

multiple pass variants and docs that make it clear to newbies _and_ experts what path they should take.
documentation that explains what the "zero-cost" lifetimes mean for programs and examples that illustrate "updating resources across frames within a shared "gpu resource collection" ".
a documentation note somewhere that if the pass lifetimes are too difficult, users can always break glass and use the "wgpu" C api directly.
some combination of (1), (2), and (3)

If I had to pick one today, I think I would go for (2). Rather than complicating the api surface / being forced to support that forever and document it clearly, just see if additional docs and examples for the "zero cost lifetimes" solves the problem well enough for users. If this continues to be a problem you can always add the variant(s). Removing apis is harder on users than adding features, so I think it makes sense to bias toward a smaller api.

cart on 9 Mar 2020

👍1

The other interesting aspect is that in this use case, we don't care about the Buffer lifetime in terms of it's memory location. We only care that the Drop impl does not run. The difference is subtle but it opens up some other possibilities. For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.

While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc? Are there guards in WGPU core/native that protect against this? If so, does it make sense to have guards in this case, but not in the case of a buffer being dropped before the pass is finished recording?

aloucks on 11 Mar 2020

For example, we could relax the lifetime constraints and then alter the Drop impl to send the ID to a deferred deletion list rather than delete immediately.

Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).

While on the topic of lifetimes and safety, what happens if a Device is dropped before a Queue, Buffer, BindGroup, etc?

Generally, we have all the objects refcounted, and you don't lose the device just because you drop it. The only exception really is render/compute pass recording, where we only want to work with ID and not go into the objects themselves (until the recording is finished) to bump the refcounts.

kvark on 11 Mar 2020

Yep, we could do something like that as well. It would also involve a different signature for render pass functions though (since you'd be lifting the lifetime restriction we have today).

This would appear the simplest option to me. It can probably even be done without breaking changes:

pub enum BufferOwnedOrRef<'a> {
    Owned(Buffer),
    Ref(&'a Buffer),
}

impl<'a> From<Buffer> for BufferOwnedOrRef<'a> {
    fn from(b: Buffer) -> Self {
        BufferOwnedOrRef::Owned(b)
    }
}

impl<'a> From<&'a Buffer> for BufferOwnedOrRef<'a> {
    fn from(b: &'a Buffer) -> Self {
        BufferOwnedOrRef::Ref(b)
    }
}

pub fn set_vertex_buffer<'a, B: Into<BufferOwnedOrRef<'a>>(
    &mut self,
    slot: u32,
    buffer: B,
    offset: BufferAddress,
    size: BufferAddress
)

dhardy on 24 Apr 2020

@dhardy yes, we could. I hesitate, however, because I see the value in not promoting the code path where the user creates resources in the middle of a render pass. It's an anti-pattern. The only reason that could make this path appealing today is because updating GPU data is hard.

Here is what needs to happen (ideally) when you are creating a new vertex buffer with data:

a new chunk of staging/IPC memory is linearly allocated for the data
the data is filled in or copied over to that staging chunk
a piece of GPU memory is allocated for the data
a copy operation is encoded and enqueued, it copies from staging to the GPU memory

Now, imagine you already have a buffer that is big enough(!). That would spare you (3) but otherwise follow the same steps. Therefore, there is no reason for us to make it easy to create new buffers, even if you are replacing all the contents of something. It's always more efficient to use an existing one.

The only caveat is - what if you need a bigger buffer? Let's see if this becomes a blocker.

For the data uploads, the group is still talking about the ways to do it. Hopefully, soon...

kvark on 27 Apr 2020

FYI, you can emulate the ArcRenderPass API using arenas in user space, and it should basically be just as efficient as the equivalent WGPU API (unless core implementation details change a lot to increment internal reference counts before the RenderPass is dropped).

struct ArcRenderPass<'a> {
    arena: &'a TypedArena<Arc<Buffer>>,
    render_pass: RenderPass<'a>
}

impl<'a> ArcRenderPass<'a> {
  fn set_vertex_buffer(&mut self, slot: u32, buffer: Arc<Buffer>, offset: BufferOffset) {
    let buffer = self.arena.alloc(buffer);
    self.render_pass.set_vertex_buffer(slot, buffer, offset);
  }
}

fn blah<'a>(encoder: &'a mut CommandEncoder) {
    let arena = TypedArena::new();
    let arc_render_pass = ArcRenderPass {
        arena,
        render_pass: encoder.begin_render_pass(..),
   };
   // ... Do stuff; you can pass around &mut ArcRenderPass and call set_vertex_buffer on owned `Arc`s.
}

pythonesque on 20 May 2020

@pythonesque it would be wonderful if we had that used by one of the examples. Would you mind doing a PR for this? We'd then be able to point users to working code instead of this snippet.

kvark on 20 May 2020

Just to provide another data point, I hit this issue as well.

Consider that I want my user to be able to simply call an API to render high-level objects without worrying about details of which buffers to use. There are 2 options:
1) Define a buffer size upfront. If they exceed it at any point when issuing a high-level render call, internally end the current render pass, and then set up the same render pass again so we can reuse the same buffer.

2) Instead of ending the render pass, we create buffers dynamically within the same render-pass to avoid setting up identical states.

I'm not sure which is more performant. With option (1), it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism (unless I force the user to go full async and/or use double-buffering). With option (2), we're infinitely-buffered but pays for the cost of allocations.

Ultimately, with the current lifetime constraints, option (2) is not possible. So we're forced to go for option (1).

As a side point, it is a little clunky using the current buffer mapping API to go with option (1). I referred to #9 and saw this advice from @kvark:

That's why the current rough and effective way to update data is to go through create_buffer_mapped.

which seemed to contradict the approach to its core.

DefinitelyNotRobot on 3 Oct 2020

@DefinitelyNotRobot I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:

it seems good but we are blocked until the GPU has finished its job, effectively losing parallelism

Also, this part:

As a side point, it is a little clunky using the current buffer mapping API to go with option (1).

This issue #9 is actually no longer a problem. The upstream WebGPU API went in this direction, and it's a part of wgpu-0.6.

Did you consider using the TypedArena<Arc<wgpu::Buffer>> and stuff like https://github.com/gfx-rs/wgpu-rs/issues/188#issuecomment-631143941 suggests?

kvark on 4 Oct 2020

I don't think I understand your thoughts clearly. For example, this part seems to be unrelated to the issue at hand:

Sorry about that. I was trying to illustrate the 2 designs that I could go with my API + their pros/cons and meant to say that option (2) was not even considerable because of RenderPass's lifetime constraints.

Did you consider using the TypedArena<Arc<wgpu::Buffer>> and stuff like #188 (comment) suggests?

I was hesitant because that would mean an allocation for TypedArena every render call but on second thought, it seems I could haul it out and store it in my renderer object instead. I'll give that a try.

DefinitelyNotRobot on 4 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings