Ray: [rllib] tensorflow 1.14 doesn't work with GPUs any longer

Created on 22 Jul 2020  路  7Comments  路  Source: ray-project/ray

What is the problem?

Using a recent nightly build of Ray/RLlib, you can't train using GPUs with TensorFlow 1.14 due to an API mismatch.

rollout_worker.py assumes that tensorflow has a function list_physical_devices but in 1.14, it's only experimental_list_devices, so you get

AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'

Here's the code in question in rollout_worker.py:

if (ray.is_initialized() and
                ray.worker._mode() != ray.worker.LOCAL_MODE):
            # Check available number of GPUs
            if not ray.get_gpu_ids():
                logger.debug(
                    "Creating policy evaluation worker {}".format(
                        worker_index) +
                    " on CPU (please ignore any CUDA init errors)")
            elif (policy_config["framework"] in ["tf2", "tf", "tfe"] and
                  not tf.config.list_physical_devices("GPU")) or \
                    (policy_config["framework"] == "torch" and
                     not torch.cuda.is_available()):
                raise RuntimeError(
                    "GPUs were assigned to this worker by Ray, but "
                    "your DL framework ({}) reports GPU acceleration is "
                    "disabled. This could be due to a bad CUDA- or {} "
                    "installation.".format(
                        policy_config["framework"],
                        policy_config["framework"]))

vs the API in tensorflow/_api/v1/config/__init__.py:

from tensorflow.python.eager.context import list_devices as experimental_list_devices

and here's the full stacktrace:

```Failure # 1 (occurred at 2020-07-22_08-30-52)
Traceback (most recent call last):
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
result = self.trial_executor.fetch_result(trial)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/worker.py", line 1532, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::PPO.train() (pid=7711, ip=10.128.0.4)
File "python/ray/_raylet.pyx", line 433, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 468, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
File "python/ray/_raylet.pyx", line 426, in ray._raylet.execute_task.function_executor
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 88, in __init__
Trainer.__init__(self, config, env, logger_creator)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 475, in __init__
super().__init__(config, logger_creator)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/trainable.py", line 232, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 639, in setup
self._init(self.config, self.env_creator)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 102, in _init
env_creator, self._policy, config, self.config["num_workers"])
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 709, in _make_workers
logdir=self.logdir)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 67, in __init__
RolloutWorker, env_creator, policy, 0, self._local_config)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 296, in _make_worker
extra_python_environs=extra_python_environs)
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 415, in __init__
not tf.config.list_physical_devices("GPU")) or \
File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/tensorflow/python/util/deprecation_wrapper.py", line 106, in __getattr__
attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'


### Reproduction (REQUIRED)
Ray: latest nightly wheel as of 2020-07-22
TensorFlow: 1.14 
Python: 3.7
OS: Ubuntu 20.04

```python
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer,
         config={
             "env": "CartPole-v0",
             "num_workers": 4,
             "num_envs_per_worker": 2,
             "num_gpus": 0.5,
             "num_gpus_per_worker": 0.1,
         })
P0 bug rllib

Most helpful comment

Maybe the correct fix is to use tf.config.experimental.list_physical_devices('GPU')?
(as noted on https://www.tensorflow.org/guide/gpu )

That exists in both TF 1.4 and 2.2.

All 7 comments

Maybe the correct fix is to use tf.config.experimental.list_physical_devices('GPU')?
(as noted on https://www.tensorflow.org/guide/gpu )

That exists in both TF 1.4 and 2.2.

I guess I could submit a patch for this if that's easiest.

Thanks @andrew-rosenfeld-ts for filing this! Taking a look right now ...

Yeah, it's really just that one line. No worries, will PR right now ... (probably merged later today).

This PR fixes the issue:
https://github.com/ray-project/ray/pull/9681
Leaving this open until merged.

@andrew-rosenfeld-ts

Btw, we are very close to setting up daily automatic GPU + "heavy" regression tests (Atari, MuJoCo) to catch these things much earlier than we do right now. Hopefully, this will eliminate issues like the GPU one here altogether.

Closing this now. Feel free to re-open if there are still problems on your end.

Was this page helpful?
0 / 5 - 0 ratings