Ray: [rllib] Custom model cannot use GPU for driver when running PPO algorithm

Created on 2 Jan 2020 · 16Comments · Source: ray-project/ray

What is the problem?

When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.0
Python 3.6.6
tensorflow-gpu 2.0.0
Fedora 28

Does the problem occur on the latest wheels?

Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:

File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.

Summary

| Ray version | script | num_gpus | works? |
| ------------- |-------------| -------- | ------ |
| 0.8.0 | custom_keras_model.py | 0 | Yes |
| 0.8.0 | custom_keras_model.py | 1 | Yes |
| 0.8.0 | custom_keras_rnn_model.py | 0 | Yes |
| 0.8.0 | custom_keras_rnn_model.py | 1 | No |
| latest wheel | custom_keras_model.py | 0 | Yes |
| latest wheel | custom_keras_model.py | 1 | No |
| latest wheel | custom_keras_rnn_model.py | 0 | Yes |
| latest wheel | custom_keras_rnn_model.py | 1 | No |

Reproduction

Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py, you also need to modify the algorithm used at the top of the file.

python3 -m venv venv
. venv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install tensorflow-gpu==2.0.0

Install [rllib] dependencies

pip3 install ray[rllib]==0.8.0
pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
sed -i '158i "num_gpus": 1,' venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py
python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

P1 bug rllib

Source

internetcoffeephone

Most helpful comment

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

internetcoffeephone on 3 Jan 2020

👍2

All 16 comments

Did you install the CuDNN driver in order for TF-gpu to use the available GPU? It appears that although you've installed TF-gpu, you don't have the drivers needed to enable hardware acceleration. Check out Tensorflow's intall guide under Software Requirements.

josjo80 on 2 Jan 2020

Did you install the CuDNN driver in order for TF-gpu to use the available GPU? It appears that although you've installed TF-gpu, you don't have the drivers needed to enable hardware acceleration. Check out Tensorflow's intall guide under Software Requirements.

Yes, and it works with other configurations, e.g. a non-custom model PPO with GPU, I can see the GPU being utilized in nvidia-smi.

Does the reproduction script work correctly for you?

internetcoffeephone on 2 Jan 2020

Yes, I can reproduce your results with rllib==0.8.0 running script custom_keras_rnn_model.py. Looks like the GPU is recognized as a XLA_GPU and not a standard GPU. After looking around the webs, it appears to be an incompatibility issue with TF2.0 and the underlying CuDNN/CUDA drivers. I have CUDA 10.1 and CuDNN 7.6.2.24 which does not appear to be supported in this list - see bottom of page for TF2-gpu. You may have to go to CUDA 10.0 and CuDNN 7.4.

josjo80 on 2 Jan 2020

👍1

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

internetcoffeephone on 3 Jan 2020

👍2

Glad that worked! I may have to do the same in the near future.

josjo80 on 3 Jan 2020

This issue isn't resolve for me with the following 2 set ups:

Set up 1

Ray 0.8.2
Python 3.6.10
tensorflow-gpu 2.1.0
CUDA 10.1
cuDNN 7.6.2

Set up 2

Ray 0.8.2
Python 3.6.10
tensorflow-gpu 2.0.0
CUDA 10.0
cuDNN 7.4.2

I'm getting this error:

  File "python/ray/_raylet.pyx", line 437, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 450, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 430, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 86, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 447, in __init__
    super().__init__(config, logger_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 172, in __init__
    self._setup(copy.deepcopy(self.config))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 591, in _setup
    self._init(self.config, self.env_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 112, in _init
    self.optimizer = make_policy_optimizer(self.workers, config)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/ppo/ppo.py", line 95, in choose_policy_optimizer
    shuffle_sequences=config["shuffle_sequences"])
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 120, in __init__
    self.per_device_batch_size, policy.copy))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_impl.py", line 91, in __init__
    len(input_placeholders)))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_impl.py", line 291, in _setup_device
    graph_obj = self.build_graph(device_input_slices)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/dynamic_tf_policy.py", line 256, in copy
    TFPolicy._initialize_loss(instance, loss, loss_inputs)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/tf_policy.py", line 231, in _initialize_loss
    self._sess.run(tf.global_variables_initializer())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Constraining by assigned device should not cause an error. Original root's assigned device name: /job:localhost/replica:0/task:0/device:GPU:0 node's 
assigned device name "/job:localhost/replica:0/task:0/device:GPU:1. Error: Cannot merge devices with incompatible ids: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/d
evice:GPU:1'

felixs8696 on 29 Mar 2020

I'm not sure this actually worked. The substitution line added "num_gpus" to line 158, but it should have been added to line 159. If you add to line 158, gpus are not used at all.

If you use this reproduction script, you still get these errors here https://github.com/ray-project/ray/issues/7819:

#!/usr/bin/env bash

python3 -m venv .env
source .env/bin/activate
pip install --upgrade pip setuptools wheel
pip install tensorflow-gpu==2.1.0
#Install [rllib] dependencies
pip install ray==0.8.2
pip install ray[rllib]==0.8.2
pip install pandas
sed -i '159i \            "num_gpus": 2,' .env/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

sudo apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated \
     cuda-command-line-tools-10-1 \
     cuda-cudart-dev-10-1 \
     cuda-cufft-dev-10-1 \
     cuda-curand-dev-10-1 \
     cuda-cusolver-dev-10-1 \
     cuda-cusparse-dev-10-1 \
     libcudnn7=7.6.2.24-1+cuda10.1 \
     libnccl2=2.4.7-1+cuda10.1 \
     libnccl-dev=2.4.7-1+cuda10.1 \
     && apt-get remove -y \
     libcublas10=10.1.0.105-1 \
     libcublas-dev=10.1.0.105-1

python .env/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

felixs8696 on 31 Mar 2020

Hi @felixs8696 Sorry for the late reply. I am getting similar errors when running a custom RNN model and specifying more than 1 GPU in the training script.
Setup:
Python: 3.6.9
Ray 0.8.0.dev6
Tensorflow-gpu 2.0.0
CUDA 10.1
CuDNN 7.6.2.24

Error:
tensorflow.python.framework.errors_impl.InternalError: Constraining by assigned device should not cause an error. Original root's assigned device name: /job:localhost/replica:0/task:0/device:GPU:0 node's assigned device name "/job:localhost/replica:0/task:0/device:GPU:1. Error: Cannot merge devices with incompatible ids: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/device:GPU:1'

I had assumed that this was due to the TF-gpu 2.0.0 compatibility issue with the CuDNN/CUDA driver versions. But I just updated to TF-gpu 2.1.0 and I'm getting the same error.

Seems like this is a RLlib issue.

josjo80 on 7 Apr 2020

@ericl I'm having this exact problem too (same tf version, cuda version, cudnn) over multiple machines. Please reopen the issue.

justinkterry on 8 May 2020

I have the same problem with:
Python: 3.6.6
Ray 0.8.5
Tensorflow-gpu 2.1.0
CUDA 10.0
CuDNN 7.6.5

Interestingly, with the same config I was able to instantiate a Trainer() and do trainer.train(). Using Ray.tune threw the above error.

annaluo676 on 19 May 2020

I just tried on a DLAMI machine + TF 2.2 + latest Ray, and custom_keras_model.py with num_gpus: 1 runs for me.

Perhaps the latest TF 2.2 fixes the issue?

ericl on 20 May 2020

Hey @ericl . Thanks for checking in on this issue. I think the problem is with using multiple gpus. I can run a custom model with 1 gpu but not with more than 1.

josjo80 on 20 May 2020

That works for me as well (with --run=PPO. DQN does not support multi GPU).

ericl on 20 May 2020

This sounds like a similar bug fixed in TF 2.2 nightly: https://github.com/tensorflow/tensorflow/issues/31318

ericl on 20 May 2020

I'm not sure this actually worked. The substitution line added "num_gpus" to line 158, but it should have been added to line 159. If you add to line 158, gpus are not used at all.

@felixs8696 In the wheel at that time, line 158 was correct. Unfortunately this repro script breaks because the content behind the latest wheel link changes. Thanks for pointing this out, I'll try to write less brittle repro scripts next time.

internetcoffeephone on 21 May 2020

Closing this for now; feel free to-reopen if someone can reproduce with TF2.2.

ericl on 23 May 2020

Was this page helpful?

0 / 5 - 0 ratings