Keras: With Tensorflow 1.12 and multi_gpu_model the number of gpus is not inferred correctly

Created on 15 Nov 2018 · 13Comments · Source: keras-team/keras

With Tensorflow 1.12 and multi_gpu_model the number of gpus needs to be specified explicitly. Otherwise one gets an error:

Consider the following minimal example:

from keras import Model, Input
from keras.layers import Dense
from keras.utils import multi_gpu_model
import os
import tensorflow as tf

os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"  # 2 gpus enabled

# dummy model
x = Input(shape=(4,))
layer = Dense(2, activation='relu')(x)
y = Dense(1)(layer)

with tf.device('/cpu:0'):
  model = Model(inputs=x, outputs=y)
parallel_model = multi_gpu_model(model)

one gets the following error:

Traceback (most recent call last):
File "/home/darte/dereverb/todelete.py", line 16, in
parallel_model = multi_gpu_model(model)
File "/home/darte/.local/lib/python3.5/site-packages/keras/utils/multi_gpu_utils.py", line 181, in multi_gpu_model
available_devices))
ValueError: To call multi_gpu_model with gpus=3, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2']. However this machine only has: ['/cpu:0', '/xla_gpu:0', '/xla_cpu:0', '/gpu:0', '/gpu:1']. Try reducing gpus.

Replacing parallel_model = multi_gpu_model(model) by parallel_model = multi_gpu_model(model, gpus=2) then the model works fine.

[X] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/keras-team/keras.git --upgrade --no-deps
[X] Check that your version of TensorFlow is up-to-date. The installation instructions can be found here.
[X] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

tensorflow buperformance

Source

darteaga

👍1

Most helpful comment

It seems that xla_gpu is ignored https://github.com/keras-team/keras/pull/9226#issuecomment-495415569

wendingp on 6 Jun 2019

👍4

All 13 comments

Thanks, @darteaga -- can you try the same using tf.keras? Do the number of GPUs get set correctly?

karmel on 3 Dec 2018

@karmel I tried, and with tf.keras the number of gpus is a compulsory argument, not optional. Namely, if I try the example above replacing keras by tf.keras I get:

TypeError: multi_gpu_model() missing 1 required positional argument: 'gpus'

darteaga on 10 Jan 2019

Correct-- the expectation is that you are explicitly requesting GPUs, and the number will get checked against the available set. If you don't know ahead of time how many GPUs you have/want, you can use tf.keras.backend.get_session().list_devices() to check available devices.

karmel on 10 Jan 2019

It looks like Keras only sees one of the GPUs.
Make sure that all GPUs are accessible.you can use device_lib with TensorFlow.
You can check all device list using following code:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

amrit-bhaskar on 19 Apr 2019

It seems that xla_gpu is ignored https://github.com/keras-team/keras/pull/9226#issuecomment-495415569

wendingp on 6 Jun 2019

👍4

@wendingp To correct my previous statement: exact 1 xla_cpu and 1 xla_gpu are visible to Tensorflow users when the engine is compiled with XLA enabled, and this number is not related to whether there are multiple physical GPUs equipped or not.
Multiple physical GPU devices to use difference XLA device simultaneously is out of the support by Tensorflow at the moment. So training with multiple traditional GPUs is still the only choice.

ghostplant on 7 Jun 2019

Solved with (my mistake GPU:2 old code before I installed the third board :-( )

config = ConfigProto()
config = tf.ConfigProto( device_count = {'GPU': 3 , 'CPU': 12} )
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
keras.backend.set_session( session)

The error was (xla_gpu is ignored) Keras see only 2 GPU instead of 3 GPU

at line

model_gpu = multi_gpu_model( model, gpus=3)
error is

ValueError: To call `multi_gpu_model` with `gpus=3`, 
we expect the following devices to be available: 
['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2']. 
However this machine only has: 
['/cpu:0', '/cpu:1', '/cpu:2', '/cpu:3', '/cpu:4', '/cpu:5', '/cpu:6', '/cpu:7', '/cpu:8', '/cpu:9', '/cpu:10', '/cpu:11', 
'/xla_gpu:0', '/xla_gpu:1', '/xla_gpu:2', 
'/xla_cpu:0', 
'/gpu:0', '/gpu:1']
Try reducing `gpus`.

print( "devices are = ", tf.keras.backend.get_session().list_devices())

devices are = [
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 3450774613774870734),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:1, CPU, 268435456, 10924024491853209621),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:2, CPU, 268435456, 11077961182932239483),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:3, CPU, 268435456, 3361844039556185648),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:4, CPU, 268435456, 10891319530738938282),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:5, CPU, 268435456, 9919963760930538434),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:6, CPU, 268435456, 12291411013128890395),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:7, CPU, 268435456, 17874787863665771808),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:8, CPU, 268435456, 8574429556929948786),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:9, CPU, 268435456, 13187484019828111478),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:10, CPU, 268435456, 18268447936190623343),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:11, CPU, 268435456, 14930379752775399022),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 3048284492670690888),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:1, XLA_GPU, 17179869184, 6488159888065492359),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:2, XLA_GPU, 17179869184, 15466543791014750049),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 12709901572195720999),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 7826230477, 18318922565637584313),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:1, GPU, 7842168832, 10861403049335112838)]

nvidia-smi reports 3 GPU

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11283 C python3 113MiB |
| 1 11283 C python3 85MiB |
| 2 11283 C python3 97MiB |
+-----------------------------------------------------------------------------+

dorinionescu on 17 Jul 2019

read the solution on my previous message
now I am going to give you a second step
Keras multi gpu may fail to allocate the memory of the GPU boards
multi_gpu_model( model, gpus=3)
The allocation requests may come in an order that goes to allocating amounts on all GPU and in the end the big request may come and get an CUDA OUT of MEMORY
In this case I deactivated any call of multi gpu model and find the memory allocation for each call and reallocate manually all the Keras calls with
with tf.device("/gpu:1"): #or 0 or 2 ...
until I found manually the right allocation scheme (which Keras call to go on which GPU)

And finally I am happy - see below, the final manual memory allocation of Keras super complex GAN
There is improbable that a system would automatically avoid CUDA OUT of Memory without another AI that should try options like I manually did (send a lot of small calls to a 6 GB GPU and keep my 8 GB free for the bigger sharks)

Until Keras implement Nvidia Unified Memory Scheme keep reading this post (there is no chuncking the call of memory_allocation on 2 GPU - you need to manually configure the contiguous space (inside one GPU memory) needed by your tensors ... so start measure the memory allocation of each of your Keras calls)

step 3 - I tried to improved my manual solution but one Keras complex command do not respond to
with tf.device("/gpu:2"):
and it remains on gpu:0 - for this case I will switch the GPU boards in the hardware (at my next upgrade GPU:0 will not be the best board :-( in order to workaround Keras )

(I did not test the swap options - if exists ...)

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18367 C python3 4695MiB |
| 1 18367 C python3 4287MiB |
| 2 18367 C python3 7681MiB |
+-----------------------------------------------------------------------------+

dorinionescu on 17 Jul 2019

@ghostplant so you mean I need to install tensorflow-gpu without XLA to use multiple physical GPUs? How to do it then?

wendingp on 17 Jul 2019

@wendingp I think this is not related to inferring the number of GPUs because it is still wrong even if you explicitly set the specific GPUs number in a traditional way. The problem is mainly caused by undetermined GPU device naming between xla_gpu and gpu, and this is supposed not to happen if tensorflow is built without XLA enabled.
If you use the latest keras, the auto-inferred gpu number should be 2 and it should work correctly but not using the 3rd GPU because there is no /gpu:2 found.

The device naming of tensorflow is changing by versions so it is also possible to have such combination for a 3-GPU host:

['/xla_gpu:0', '/xla_cpu:0',  '/gpu:0', '/gpu:1', '/gpu:2']

where /xla_gpu:1 and /xla_gpu:2 don't exist.

So,
the simplest is to try a Tensorflow without XLA support.
or we need to wait for the naming of Tensorflow for GPU device in all cases to be standard and fixed,
or a much complex detection logic for Tensorflow device naming in all cases should be added to the implementation of keras multi_gpu_model.

ghostplant on 17 Jul 2019

@wendingp So where do you install the tensorflow package, via pip install tensorflow-gpu==1.12.0 ?

ghostplant on 17 Jul 2019

@ghostplant previously I just used pip install tensorflow-gpu

wendingp on 17 Jul 2019

@wendingp Yeah, the tensorflow prebuilt package seems to enable XLA option by default since 1.13.x. If you fallback to 1.12.0 which is based on cuda-9.0, XLA is not enabled, but the CUDA driver might not match so it will be a little annoying to change driver environment.

It is weird that you actually have 3 physical GPUs, but can only see /device:XLA_GPU:[0, 1, 2] and /device:GPU:[0, 1] devices (no /device:GPU:2 found)?

This is my output of tensorflow devices on multiple GPU hosts:

['/device:CPU:0', '/device:XLA_GPU:0', '/device:XLA_GPU:1', '/device:XLA_GPU:2', '/device:XLA_GPU:3', '/device:XLA_GPU:4', '/device:XLA_GPU:5', '/device:XLA_GPU:6', '/device:XLA_GPU:7', '/device:XLA_CPU:0', '/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3', '/device:GPU:4', '/device:GPU:5', '/device:GPU:6', '/device:GPU:7']

So are you really NOT able to find /device:GPU:2 in your environment?

ghostplant on 19 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings