Ray: Have ray check for GPU usage by other users

Created on 24 Aug 2019  Â·  7Comments  Â·  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Ray installed from (source or binary): pip install ray
  • Ray version: 0.8.0.dev3
  • Python version: 3.7.3
  • Exact command to reproduce: Launch a job on a cluster whose gpus are in use already by other users

Describe the problem

Currently Ray keeps track of the GPUs (and other resources) that _it_ is using. However, to my understanding, and based on the experiments I have run, it doesn't check whether _other_ users are using GPUs before running on them. As a consequence, it is possible to start a GPU application and have this terminate in error immediately due to a "Cuda out of memory" exception when the GPU is in heavy use.

One workaround here would be to write ray remote function defensively, monitoring (e.g. via GPUtil) the usage of the GPUs it has been assigned by ray, and only allocating memory once there is sufficient available. However, potentially it could take a long time for the other applications using the GPUs to finish, and meanwhile there might be other nodes on the cluster whose GPUs are not being used that could run the job right away.

Is there a way to have ray check for current GPU usage before assigning GPUs to remote functions or actors? If not, this would be a very useful feature.

Source code / logs

enhancement stale

Most helpful comment

I'm primarily interested in Tune, too.
I'm particularly interested in the scenario where multiple users cooperate to use Tune.

A flexible framework should be able to deal with the following scenario without pain:

  • There are 8 GPUs in a machine. At first, Alice and Bob share the machine. Alice mainly uses [0, 1, 2, 3] while Bob mainly uses [4, 5, 6, 7].

  • While Alice is on vacation, Bob uses all of the GPUs for extreme hyper-parameter search which may take weeks.

  • Alice returns because an excellent idea strikes her and she thinks it may win her the best paper prize. She asks for 6 GPUs in compensation for the greediness of Bob.

  • Two days later, experiments show Alice's idea is worthless. Alice plans another trip out of disappointment. Thus Bob greedily takes all GPUs again.

By "without pain", I mean Bob doesn't need to kill his process throughout the whole story.

Here is a decent solution I come up with and should be implemented by Ray with a few modifications:

  • One can specify which GPUs to use when Ray is initialized. (currently, we can only specify num_gpus). For example:
ray.init(gpu_ids=[0, 1, 2, 3])
  • One can easily change the GPUs to use for Ray by the web_ui provided by Ray.
ray.init(gpu_ids=[0, 1, 2, 3], include_webui=True)

In the web ui, we can add GPUs or mark some GPUs as unavailable.

This way, Bob initializes Ray with all GPUs. When Alice returns, Bob marks 6 GPUs as unavailable. After works running on these GPUs exit, Alice can carry out her plan of the best paper. When Alice leaves, Bob adds these GPUs to Ray, carrying on his hyper-parameter search plan.

To achieve this, we only need to modify Ray so that it maintains the IDs of available GPUs, which I think can be implemented with a few modifications.

The modifications can be made backward-compatible, too. Just add a parameter to the init function:

def init(..., gpu_ids=None):
    gpu_ids = gpu_ids or available_gpus[:num_gpus]

All 7 comments

I do agree this is incredibly useful - are you using Ray primarily via RLlib or Tune?

How beneficial would a utility that determines the available (unused) GPUs at the beginning of execution, and then for the rest of the execution to only draw from that subset of GPUs?

This is the easiest bandaid I can think of. Otherwise, if you need this to be continuously checked during execution (i.e., other users have varying workloads), then we should revisit the way we handle GPU scheduling to support this.

I'm primarily interested in Tune.

I think it would definitely be an improvement to determine GPU usage at the
beginning of execution. However, I work in an environment where GPU usage
varies a lot, and so it is quite possible that at the beginning of
execution many more GPUs would be in use compared with some time shortly
after. In this case it would be ideal if Ray could detect the extra free
GPUs and make use of them.

Personally, I have a workaround (outside of Ray) that is sufficient for me
at the moment. However, I thought I would suggest this since I think
Ray/Tune is a great framework, and I can see something like this being of
interest to other users like myself.

On Sat, 24 Aug 2019 at 23:45, Richard Liaw notifications@github.com wrote:

I do agree this is incredibly useful - are you using Ray primarily via
RLlib or Tune?

How beneficial would a utility that determines the available (unused) GPUs
at the beginning of execution, and then for the rest of the execution to
only draw from that subset of GPUs?

This is the easiest bandaid I can think of. Otherwise, if you need this to
be continuously checked during execution (i.e., other users have varying
workloads), then we should revisit the way we handle GPU scheduling to
support this.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/5528?email_source=notifications&email_token=AATVTIOLLEYFXBYDM7HKOVTQGG2YFA5CNFSM4IPFVUH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CIYDQ#issuecomment-524585998,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AATVTIL75HAFXZN45K3MYYLQGG2YFANCNFSM4IPFVUHQ
.

I'm primarily interested in Tune, too.
I'm particularly interested in the scenario where multiple users cooperate to use Tune.

A flexible framework should be able to deal with the following scenario without pain:

  • There are 8 GPUs in a machine. At first, Alice and Bob share the machine. Alice mainly uses [0, 1, 2, 3] while Bob mainly uses [4, 5, 6, 7].

  • While Alice is on vacation, Bob uses all of the GPUs for extreme hyper-parameter search which may take weeks.

  • Alice returns because an excellent idea strikes her and she thinks it may win her the best paper prize. She asks for 6 GPUs in compensation for the greediness of Bob.

  • Two days later, experiments show Alice's idea is worthless. Alice plans another trip out of disappointment. Thus Bob greedily takes all GPUs again.

By "without pain", I mean Bob doesn't need to kill his process throughout the whole story.

Here is a decent solution I come up with and should be implemented by Ray with a few modifications:

  • One can specify which GPUs to use when Ray is initialized. (currently, we can only specify num_gpus). For example:
ray.init(gpu_ids=[0, 1, 2, 3])
  • One can easily change the GPUs to use for Ray by the web_ui provided by Ray.
ray.init(gpu_ids=[0, 1, 2, 3], include_webui=True)

In the web ui, we can add GPUs or mark some GPUs as unavailable.

This way, Bob initializes Ray with all GPUs. When Alice returns, Bob marks 6 GPUs as unavailable. After works running on these GPUs exit, Alice can carry out her plan of the best paper. When Alice leaves, Bob adds these GPUs to Ray, carrying on his hyper-parameter search plan.

To achieve this, we only need to modify Ray so that it maintains the IDs of available GPUs, which I think can be implemented with a few modifications.

The modifications can be made backward-compatible, too. Just add a parameter to the init function:

def init(..., gpu_ids=None):
    gpu_ids = gpu_ids or available_gpus[:num_gpus]

@richardliaw Do you have any suggestions about how to avoid out of cuda memory error when using ray.tune?

Smaller batch size maybe? Are you using fractional gpus?

Smaller batch size maybe? Are you using fractional gpus?

@richardliaw No, I uses multiple gpu for a trial: @ray.remote(num_gpus=8, max_calls=1) def train_model().

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

  • If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
  • If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zplizzi picture zplizzi  Â·  3Comments

robertnishihara picture robertnishihara  Â·  3Comments

austinmw picture austinmw  Â·  3Comments

robertnishihara picture robertnishihara  Â·  3Comments

thedrow picture thedrow  Â·  3Comments