Keras: multi_gpu_model not working w/ TensorFlow 1.14

Created on 3 Jul 2019 · 26Comments · Source: keras-team/keras

System information

Have I written custom code (as opposed to using example directory): No/Yes (very slight change to an example)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
TensorFlow backend (yes / no): yes
TensorFlow version: 1.14
Keras version: Latest master from github
Python version: 3.7 (through Anaconda)
CUDA/cuDNN version: 10.0/7.4.2
GPU model and memory: 2x Tesla K80 (11GB each)

Describe the current behavior

I am using the cifar-10 ResNet example from the Keras examples directory, with the addition of the following line at Line number 360 (just before compilation) in order to use multiple GPUs while training. However this doesn't work.

Line Added:
model = keras.utils.multi_gpu_model(model, gpus=2)

Traceback Error log:

Traceback (most recent call last):
  File "cifar10_resnet_multigpu.py", line 360, in <module>
    model = keras.utils.multi_gpu_model(model, gpus=2)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

Describe the expected behavior

Previously, this typically worked fine and results in faster training due to parallelization across GPUs.

Note: This works fine if the backend is Tensorflow 1.13, so this is a regression.

tensorflow awaiting tensorflower buperformance

Source

rohit-gupta

Most helpful comment

@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem

rohit-gupta on 4 Jul 2019

👍4

All 26 comments

The same with you. The codes can run with tf==1.12.0, but cannot run with tf=1.14.0. I don't know the reason. The largest change is that I have transferred CUDA from 9.0 to 10.0, then nothing has been changed.

QtacierP on 4 Jul 2019

@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem

rohit-gupta on 4 Jul 2019

👍4

I have the same problem now. Currently, is downgrade an only way to solve this problem?

derekhsu on 4 Jul 2019

@derekhsu I can't really speak for Keras maintainers, but I don't know of any other solution. Bugs like this with critical features like Multi-GPU training are a big problem for Keras.

rohit-gupta on 4 Jul 2019

Same here!!!! very disappointed solutions, please.....

TheStoneMX on 5 Jul 2019

I received the same error message as above when using tf1.14, but after downgrading to 1.12 as well as to 1.13 I am confronted with:

Traceback (most recent call last):
  File "trainer_temp.py", line 226, in <module>
    main()
  File "trainer_temp.py", line 137, in main
    model = multi_gpu_model(build.models['vae'], gpus=gpus)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 227, in multi_gpu_model
    outputs = model(inputs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1858, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4261, in _apply_device_functions
    op._set_device(device_spec.function(op))
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 314, in _device_function
    current_device = DeviceSpec.from_string(node_def.device or "")
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 232, in from_string
    return DeviceSpec().parse_from_string(spec)
  File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string
    splits = [x.split(":") for x in spec.split("/")]
AttributeError: 'DeviceSpec' object has no attribute 'split'

Any suggestions whether this is caused by the same issue or if I might have another problem?

System information

OS: Red Hat Enterprise Linux
Python: 3.6.5
Keras: 2.2.4
Tensorflow: 1.12/1.13.1/1.14
Cuda: 9/10
GPU model: NVIDIA GeForce GTX980 Ti

ju-he on 12 Aug 2019

Same problem here. Log:

Traceback (most recent call last):
  File "phase_retrieval_-108_gan.py", line 39, in <module>
    discriminator = multi_gpu_model( discriminator, gpus=2 )
  File "/usr/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
    outputs = model(inputs)
  File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
    output_tensors, _, _ = self.run_internal_graph(inputs, masks)
  File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
    layer.call(computed_tensor, **kwargs))
  File "/usr/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
    epsilon=self.epsilon)
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
    if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
    explicitly_on_cpu = _is_current_explicit_device('CPU')
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
    device = _get_current_tf_device()
  File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
    g._apply_device_functions(op)
  File "/usr/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
    op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

fengwang on 20 Aug 2019

I am trying to learn Pytorch.... little by little.... I dont kow when they will fix this.... it have been three months

TheStoneMX on 20 Aug 2019

same problem here too...

AttributeError Traceback (most recent call last)
in
10 opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
11 #opt = tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
---> 12 parallel_model = multi_gpu_model(model, gpus=2)
13 #parallel_model.compile(loss='categorical_crossentropy',optimizer='rmsprop')
14 parallel_model.compile(optimizer=opt, loss=bce_dice_loss, metrics=[dice_coef])

/opt/conda/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py in multi_gpu_model(model, gpus, cpu_merge, cpu_relocation)
225 # Apply model on slice
226 # (creating a model replica on the target device).
--> 227 outputs = model(inputs)
228 outputs = to_list(outputs)
229

/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, *kwargs)
455 # Actually call the layer,
456 # collecting output(s), mask(s), and shape(s).
--> 457 output = self.call(inputs, *kwargs)
458 output_mask = self.compute_mask(inputs, previous_mask)
459

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in call(self, inputs, mask)
562 return self._output_tensor_cache[cache_key]
563 else:
--> 564 output_tensors, _, _ = self.run_internal_graph(inputs, masks)
565 return output_tensors
566

/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in run_internal_graph(self, inputs, masks)
719 kwargs['mask'] = computed_mask
720 output_tensors = to_list(
--> 721 layer.call(computed_tensor, **kwargs))
722 output_masks = layer.compute_mask(computed_tensor,
723 computed_mask)

/opt/conda/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
183 normed_training, mean, variance = K.normalize_batch_in_training(
184 inputs, self.gamma, self.beta, reduction_axes,
--> 185 epsilon=self.epsilon)
186
187 if K.backend() != 'cntk':

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in normalize_batch_in_training(x, gamma, beta, reduction_axes, epsilon)
1856 """
1857 if ndim(x) == 4 and list(reduction_axes) in [[0, 1, 2], [0, 2, 3]]:
-> 1858 if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
1859 return _broadcast_normalize_batch_in_training(x, gamma, beta,
1860 reduction_axes,

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _has_nchw_support()
289 bool: if the current scope device placement would support nchw
290 """
--> 291 explicitly_on_cpu = _is_current_explicit_device('CPU')
292 gpus_available = len(_get_available_gpus()) > 0
293 return (not explicitly_on_cpu and gpus_available)

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _is_current_explicit_device(device_type)
264 if device_type not in ['CPU', 'GPU']:
265 raise ValueError('device_type should be either "CPU" or "GPU".')
--> 266 device = _get_current_tf_device()
267 return (device is not None and device.device_type == device_type.upper())
268

/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _get_current_tf_device()
245 g = tf.get_default_graph()
246 op = _TfDeviceCaptureOp()
--> 247 g._apply_device_functions(op)
248 return op.device
249

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _apply_device_functions(self, op)
4579 # strings, since identity checks are faster than equality checks.
4580 if device_string is not prior_device_string:
-> 4581 op._set_device_from_string(device_string)
4582 prior_device_string = device_string
4583 op._device_code_locations = self._snapshot_device_function_stack_metadata()

AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'

KalyanKumarPichuka on 21 Aug 2019

yes, TF 1.14 issue, see https://github.com/tensorflow/tensorflow/issues/30728

Borda on 23 Aug 2019

👍1

i found a workaround (which worked at least for my configuration):
since parse_from_string treated the DeviceSpec object it wanted to receive as an string, there was a simple solution, adding the following in
site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string:

if isinstance(spec, DeviceSpec):
        return spec

might not be the optimal solution, but it finally allowed me to use all of my GPUs...

ju-he on 26 Aug 2019

@ju-he thanks for the post, but I cant find that line of code, I dont have anything in line 150, just some comments....

I have this

if isinstance(spec, MergeDevice):
return spec

but never found

parse_from_string in the file

I don't understand what did you changed...

Thanks.

TheStoneMX on 26 Aug 2019

@TheStoneMX which tf version are you using? I downgraded from 1.14 to 1.12 and to 1.10, since some suggested this solution but I still had the issues as described above. Maybe it's due to my specific configuration. But do I understand you correctly, that the "if isinstance-return spec" part (that's the only thing I added) is already there in your version? Then apparently this bug has already been fixed, just not in the version I was using.

ju-he on 26 Aug 2019

@ju-he I am using 1.14 and still can't use multiple GPUs, I am thinking to learn Pytorch..... it has been too long and they are not fixing this bug

TheStoneMX on 27 Aug 2019

@TheStoneMX have you tried switching to tf 1.12 or even 1.10? This together with the Bugfix I posted above should work fine.

ju-he on 27 Aug 2019

Hi @ju-he

Thanks for the email, but I havent be able, I am using conda and everytim I install keras, it reinstall 1.4....

Do you know how I can do it ?

But I thought that 1.3 does not have this problem, because before everything was working.

TheStoneMX on 27 Aug 2019

Hi @TheStoneMX
I stopped using conda a while ago, but I think you can choose the version of a specific package via conda install package=version

ju-he on 27 Aug 2019

I was not using conda for a while then I started to use it again, I am switching to see if I can make work, thanks bro.

TheStoneMX on 27 Aug 2019

Hi, same issue here when trying to use multi GPU with Keras.
I had to fall back to tensorflow-gpu 1.13.2 to make it work :(
Any news on this? Hope things get fixed soon :+1:

JVGD on 27 Aug 2019

Hi @ju-he I got it working removing anaconda and using pip3 and installed TensorFlow-GPU 1.13.2

TheStoneMX on 27 Aug 2019

I have the same problem when calling:
with device('/gpu:0' if use_GPU else '/cpu:0'): portion of code

Tensorflow-gpu 1.14 has disappointed me as well. I consider 1.13.2 a last reliable version.

Just importing it causes incompatibilities:

for example with numpy,

with management of GPUs / CPUs.

Many things have changed package path and there is no backwards compatibility, for example:

package path to TocoConverter/TFLiteConverter

package path to set_image_dim_ordering

many other places get warning like for example "tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead"

I believe 1.14 is currently more similar to TensorFlow 2 rather than to TensorFlow 1.
Why would there be explicit necessity to change package path to "v1" otherwise.

Please consider backwards compatibility for 1.x.x versions if the version still starts with 1.

karolbadowski on 11 Sep 2019

👍1

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

pip3 for python 3.7.5
tensorflow-gpu v. 1.14.0
keras 2.3.1
cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

jtk19 on 26 Dec 2019

I, too, got it working by installing a pip3 environment separate from my Anaconda environment.

pip3 for python 3.7.5

tensorflow-gpu v. 1.14.0

keras 2.3.1

cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed

Just upgrade tf version to 1.15, works for me.

jayagami on 20 Feb 2020

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

JivanRoquet on 27 Feb 2020

Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4

keras                     2.3.1                    pypi_0    pypi
tensorboard               1.15.0                   pypi_0    pypi
tensorflow-estimator      1.15.1                   pypi_0    pypi
tensorflow-gpu            1.15.0                   pypi_0    pypi

Tested on my computer.