It is common for multiple users of the same machine to share a single GPU. For example, several users may each train their own neural net on the same GPU, and as long as the aggregate memory usage does not exceed the GPU memory, then this should work out.
Some thoughts and questions.
What's the best way to support this in the API? For example, something like
@ray.remote(num_gpus=0.5)
def f():
...
It's more or less clear what to do if the user requests less than one GPU, but what if they request 1.5 GPUs. For that to be meaningful the specification would probably have to be more specific (for example, 3/4 of 2 GPUs, or 1 full GPU and 1/2 of another GPU). Does this come up in practice? Or should we just not support this case?
Presumably it'd be the responsibility of the user to not actually user more than 1/2 GPU if that's what the task requested. It probably wouldn't be super easy for Ray to enforce this. If the task does use more, then at worst this should cause that task to fail (and possibly any other task that was using the same GPU). Ideally the failure would take the form of a Python exception.
The following workaround could work (at least for simple workloads).
Say there are 4 GPUs and each task requires 1/2 GPU. Then we could start Ray with --num-gpus=8, and then if a task is given GPU ID X, it can instead use device X mod 4. The task would have to set CUDA_VISIBLE_DEVICES appropriately.
Follow up about this workaround: ray seems to set CUDA_VISIBLE_DEVICES (to the "virtual" gpu id). So if you're trying to infer the number of GPUs to do this workaround, make sure to first clear that env variable.
This is a pretty elegant workaround but considering it's been over a year since this issue was originally opened - perhaps there is a better approach now?
Anyone got a nice hack that uses custom resources?
@rueberger Any chance you got fed up and made a custom resources hack :) ? This isn't very tractable for multiple nodes
What goes wrong on multiple nodes?
This has already implemented in our rewrite of the local scheduler, but it isn't on by default yet.
The problem with the standard workaround is
But sounds like the new feature is on its way, thanks!
This is fixed in the Xray backend.
Most helpful comment
Follow up about this workaround: ray seems to set CUDA_VISIBLE_DEVICES (to the "virtual" gpu id). So if you're trying to infer the number of GPUs to do this workaround, make sure to first clear that env variable.