Ray: [rllib] Custom model cannot use GPU for driver when running PPO algorithm

Created on 2 Jan 2020  路  16Comments  路  Source: ray-project/ray

What is the problem?

When using the combination of a custom model, PPO, and a GPU for the driver, the following error appears:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation default_policy/lstm/bias/Initializer/concat: Could not satisfy explicit device specification '' because the node {{colocation_node default_policy/lstm/bias/Initializer/concat}} was colocated with a group of nodes that required incompatible device '/job:localhost/replica:0/task:0/device:GPU:0'. All available devices [/job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0].

Ray version and other system information (Python version, TensorFlow version, OS):

Ray 0.8.0
Python 3.6.6
tensorflow-gpu 2.0.0
Fedora 28

Does the problem occur on the latest wheels?

Yes, although it gives a different error, and a new combination of parameters fails as well: on custom_keras_model with num_gpus set to 1, where ray 0.8.0 does not fail with this combination. The error on the latest wheel is the following:

File "project/venv/lib64/python3.6/site-packages/ray/rllib/evaluation/rollout_worker.py", line 356, in __init__ "GPUs were assigned to this worker by Ray, but " RuntimeError: GPUs were assigned to this worker by Ray, but TensorFlow reports GPU acceleration is disabled. This could be due to a bad CUDA or TF installation.

Summary

| Ray version | script | num_gpus | works? |
| ------------- |-------------| -------- | ------ |
| 0.8.0 | custom_keras_model.py | 0 | Yes |
| 0.8.0 | custom_keras_model.py | 1 | Yes |
| 0.8.0 | custom_keras_rnn_model.py | 0 | Yes |
| 0.8.0 | custom_keras_rnn_model.py | 1 | No |
| latest wheel | custom_keras_model.py | 0 | Yes |
| latest wheel | custom_keras_model.py | 1 | No |
| latest wheel | custom_keras_rnn_model.py | 0 | Yes |
| latest wheel | custom_keras_rnn_model.py | 1 | No |

Reproduction

Please note that this only reproduces the last row in the table. In order to test custom_keras_model.py, you also need to modify the algorithm used at the top of the file.

python3 -m venv venv
. venv/bin/activate
pip3 install --upgrade pip setuptools wheel
pip3 install tensorflow-gpu==2.0.0

Install [rllib] dependencies

pip3 install ray[rllib]==0.8.0
pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.9.0.dev0-cp36-cp36m-manylinux1_x86_64.whl
sed -i '158i "num_gpus": 1,' venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py
python3 venv/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

P1 bug rllib

Most helpful comment

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

All 16 comments

Did you install the CuDNN driver in order for TF-gpu to use the available GPU? It appears that although you've installed TF-gpu, you don't have the drivers needed to enable hardware acceleration. Check out Tensorflow's intall guide under Software Requirements.

Did you install the CuDNN driver in order for TF-gpu to use the available GPU? It appears that although you've installed TF-gpu, you don't have the drivers needed to enable hardware acceleration. Check out Tensorflow's intall guide under Software Requirements.

Yes, and it works with other configurations, e.g. a non-custom model PPO with GPU, I can see the GPU being utilized in nvidia-smi.

Does the reproduction script work correctly for you?

Yes, I can reproduce your results with rllib==0.8.0 running script custom_keras_rnn_model.py. Looks like the GPU is recognized as a XLA_GPU and not a standard GPU. After looking around the webs, it appears to be an incompatibility issue with TF2.0 and the underlying CuDNN/CUDA drivers. I have CUDA 10.1 and CuDNN 7.6.2.24 which does not appear to be supported in this list - see bottom of page for TF2-gpu. You may have to go to CUDA 10.0 and CuDNN 7.4.

Correct, the repro script above works with the combination of CUDA 10.0.130.1 and cuDNN v7.4.2. Thanks for the help!

Glad that worked! I may have to do the same in the near future.

This issue isn't resolve for me with the following 2 set ups:

Set up 1

Ray 0.8.2
Python 3.6.10
tensorflow-gpu 2.1.0
CUDA 10.1
cuDNN 7.6.2

Set up 2

Ray 0.8.2
Python 3.6.10
tensorflow-gpu 2.0.0
CUDA 10.0
cuDNN 7.4.2

I'm getting this error:

  File "python/ray/_raylet.pyx", line 437, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 449, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 450, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 452, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 430, in ray._raylet.execute_task.function_executor
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 86, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 447, in __init__
    super().__init__(config, logger_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 172, in __init__
    self._setup(copy.deepcopy(self.config))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer.py", line 591, in _setup
    self._init(self.config, self.env_creator)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/trainer_template.py", line 112, in _init
    self.optimizer = make_policy_optimizer(self.workers, config)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/agents/ppo/ppo.py", line 95, in choose_policy_optimizer
    shuffle_sequences=config["shuffle_sequences"])
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_optimizer.py", line 120, in __init__
    self.per_device_batch_size, policy.copy))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_impl.py", line 91, in __init__
    len(input_placeholders)))
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/optimizers/multi_gpu_impl.py", line 291, in _setup_device
    graph_obj = self.build_graph(device_input_slices)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/dynamic_tf_policy.py", line 256, in copy
    TFPolicy._initialize_loss(instance, loss, loss_inputs)
  File "/usr/local/lib/python3.6/dist-packages/ray/rllib/policy/tf_policy.py", line 231, in _initialize_loss
    self._sess.run(tf.global_variables_initializer())
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Constraining by assigned device should not cause an error. Original root's assigned device name: /job:localhost/replica:0/task:0/device:GPU:0 node's 
assigned device name "/job:localhost/replica:0/task:0/device:GPU:1. Error: Cannot merge devices with incompatible ids: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/d
evice:GPU:1'

Related: https://github.com/ray-project/ray/issues/7747

I'm not sure this actually worked. The substitution line added "num_gpus" to line 158, but it should have been added to line 159. If you add to line 158, gpus are not used at all.

If you use this reproduction script, you still get these errors here https://github.com/ray-project/ray/issues/7819:

#!/usr/bin/env bash

python3 -m venv .env
source .env/bin/activate
pip install --upgrade pip setuptools wheel
pip install tensorflow-gpu==2.1.0
#Install [rllib] dependencies
pip install ray==0.8.2
pip install ray[rllib]==0.8.2
pip install pandas
sed -i '159i \            "num_gpus": 2,' .env/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

sudo apt-get update && apt-get install -y --no-install-recommends --allow-unauthenticated \
     cuda-command-line-tools-10-1 \
     cuda-cudart-dev-10-1 \
     cuda-cufft-dev-10-1 \
     cuda-curand-dev-10-1 \
     cuda-cusolver-dev-10-1 \
     cuda-cusparse-dev-10-1 \
     libcudnn7=7.6.2.24-1+cuda10.1 \
     libnccl2=2.4.7-1+cuda10.1 \
     libnccl-dev=2.4.7-1+cuda10.1 \
     && apt-get remove -y \
     libcublas10=10.1.0.105-1 \
     libcublas-dev=10.1.0.105-1

python .env/lib/python3.6/site-packages/ray/rllib/examples/custom_keras_rnn_model.py

Hi @felixs8696 Sorry for the late reply. I am getting similar errors when running a custom RNN model and specifying more than 1 GPU in the training script.
Setup:
Python: 3.6.9
Ray 0.8.0.dev6
Tensorflow-gpu 2.0.0
CUDA 10.1
CuDNN 7.6.2.24

Error:
tensorflow.python.framework.errors_impl.InternalError: Constraining by assigned device should not cause an error. Original root's assigned device name: /job:localhost/replica:0/task:0/device:GPU:0 node's assigned device name "/job:localhost/replica:0/task:0/device:GPU:1. Error: Cannot merge devices with incompatible ids: '/job:localhost/replica:0/task:0/device:GPU:0' and '/job:localhost/replica:0/task:0/device:GPU:1'

I had assumed that this was due to the TF-gpu 2.0.0 compatibility issue with the CuDNN/CUDA driver versions. But I just updated to TF-gpu 2.1.0 and I'm getting the same error.

Seems like this is a RLlib issue.

@ericl I'm having this exact problem too (same tf version, cuda version, cudnn) over multiple machines. Please reopen the issue.

I have the same problem with:
Python: 3.6.6
Ray 0.8.5
Tensorflow-gpu 2.1.0
CUDA 10.0
CuDNN 7.6.5

Interestingly, with the same config I was able to instantiate a Trainer() and do trainer.train(). Using Ray.tune threw the above error.

I just tried on a DLAMI machine + TF 2.2 + latest Ray, and custom_keras_model.py with num_gpus: 1 runs for me.

Perhaps the latest TF 2.2 fixes the issue?

Hey @ericl . Thanks for checking in on this issue. I think the problem is with using multiple gpus. I can run a custom model with 1 gpu but not with more than 1.

That works for me as well (with --run=PPO. DQN does not support multi GPU).

This sounds like a similar bug fixed in TF 2.2 nightly: https://github.com/tensorflow/tensorflow/issues/31318

I'm not sure this actually worked. The substitution line added "num_gpus" to line 158, but it should have been added to line 159. If you add to line 158, gpus are not used at all.

@felixs8696 In the wheel at that time, line 158 was correct. Unfortunately this repro script breaks because the content behind the latest wheel link changes. Thanks for pointing this out, I'll try to write less brittle repro scripts next time.

Closing this for now; feel free to-reopen if someone can reproduce with TF2.2.

Was this page helpful?
0 / 5 - 0 ratings