Gfx: [RFC] the case of OutOfMemory errors

Created on 16 Jan 2019  路  6Comments  路  Source: gfx-rs/gfx

This is a logical follow-up to #2568, which discusses DeviceLost.

A lot of Vulkan methods are allowed to return OOM for either host or device memory. I'll try to argue against exposing those 1:1 in gfx-hal.

Let's start with host OOM. Since Rust's standard library also uses host memory, and it panics on OOM instead of returning an error, I believe it makes sense to have gfx-hal following the same behavior. That will hold true until there is kind of nostd support for gfx-hal, which we don't even have on the roadmap.

Now to device OOM. I'd argue that we should have this error returned only when the methods are trying to allocate anything in device memory, such as allocate_memory(), pool creation, and a few more.

average hal refactor medium

Most helpful comment

Handling device OOM internally seems bit off. Cause any other subsequent call will return OOM as well. Clearly user can handle this by reducing device memory usage, cleaning up caches etc.
Host OOM can happen in same function calls so we won't reduce signature complexity if we leave only device OOM.

While I can't disagree that panicking on those errors can be handy for fast prototyping, but we can't put them back without breaking change, so we have to settle on something before 1.0.
And for people who would want to handle those errors panicking would mean that they can't use gfx-hal at all. And in bright future users would want (and can) to handle host OOM too.

tl;dr; we can remove them for now, but I think we would have to return them later, so why bother?

All 6 comments

I don't have any objections regarding a panic! for host OOM. Seems sensible to me if there's no way to handle out of memory errors in the standard library anyway.

As for device OOM errors, if I recall correctly, your argument for not returning them was that they are difficult (or practically impossible) to handle correctly, because every function call could theoretically fail with this error, and we should use panics instead for such failures. I looked at the Vulkan specification for which functions return this error, and I think it's like 60, which is indeed a lot.

If we limited device OOM errors to only functions that allocate large amounts of memory, can we be sure that the allocation actually happens in that function, and is not deferred until vkQueueSubmit for example?

@aleksijuvani thanks for comments!
We could use some common sense here. If a function is relatively heavy (aka submit) and already returns a Result, then we might as well allow it to return device OOM. If a function is seemingly unrelated and only returns OOM, aka create_pipeline_cache, then I think we can just handle it internally, i.e. detect OOM in vulkan backend and create an empty pipeline cache if that happens, maybe warn! the user, etc.

Handling device OOM internally seems bit off. Cause any other subsequent call will return OOM as well. Clearly user can handle this by reducing device memory usage, cleaning up caches etc.
Host OOM can happen in same function calls so we won't reduce signature complexity if we leave only device OOM.

While I can't disagree that panicking on those errors can be handy for fast prototyping, but we can't put them back without breaking change, so we have to settle on something before 1.0.
And for people who would want to handle those errors panicking would mean that they can't use gfx-hal at all. And in bright future users would want (and can) to handle host OOM too.

tl;dr; we can remove them for now, but I think we would have to return them later, so why bother?

If a function is seemingly unrelated and only returns OOM, aka create_pipeline_cache, then I think we can just handle it internally, i.e. detect OOM in vulkan backend and create an empty pipeline cache if that happens, maybe warn! the user, etc.

This doesn't make sense to me. Functions should either panic! or return a Result in regards to OOM, not try and handle it internally.

Thinking about this a bit more, I think it's a bad idea to look at the functions in isolation to decide if they should return a device OOM error.

Let's say that I have a larger function that

  1. allocates some large buffer on the device and
  2. waits on some fence

You have to take into account that if the device memory is close to full, you won't necessarily fail on the large allocation, but waiting on the fence (since it is allowed allocate device memory) could be the straw that breaks the camel's back.

If the fence wait function is changed so that it no longer returns device OOM errors, the user can no longer handle this gracefully, e.g. retry with a smaller buffer, or clean up some caches or what have you.

I suppose you could crash later anyway when rendering a frame, so maybe it's impossible to handle.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

kvark picture kvark  路  3Comments

kvark picture kvark  路  4Comments

grovesNL picture grovesNL  路  3Comments

djcsdy picture djcsdy  路  4Comments

Lokathor picture Lokathor  路  4Comments