System information
Describe the current behavior
I am using the cifar-10 ResNet example from the Keras examples directory, with the addition of the following line at Line number 360 (just before compilation) in order to use multiple GPUs while training. However this doesn't work.
Line Added:
model = keras.utils.multi_gpu_model(model, gpus=2)
Traceback Error log:
Traceback (most recent call last):
File "cifar10_resnet_multigpu.py", line 360, in <module>
model = keras.utils.multi_gpu_model(model, gpus=2)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
outputs = model(inputs)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
output = self.call(inputs, **kwargs)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
epsilon=self.epsilon)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
explicitly_on_cpu = _is_current_explicit_device('CPU')
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
device = _get_current_tf_device()
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
g._apply_device_functions(op)
File "/local/home/manasa/vpds2/conda/anaconda3/envs/tensorflow114/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
Describe the expected behavior
Previously, this typically worked fine and results in faster training due to parallelization across GPUs.
Note: This works fine if the backend is Tensorflow 1.13, so this is a regression.
The same with you. The codes can run with tf==1.12.0, but cannot run with tf=1.14.0. I don't know the reason. The largest change is that I have transferred CUDA from 9.0 to 10.0, then nothing has been changed.
@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem
I have the same problem now. Currently, is downgrade an only way to solve this problem?
@derekhsu I can't really speak for Keras maintainers, but I don't know of any other solution. Bugs like this with critical features like Multi-GPU training are a big problem for Keras.
Same here!!!! very disappointed solutions, please.....
I received the same error message as above when using tf1.14, but after downgrading to 1.12 as well as to 1.13 I am confronted with:
Traceback (most recent call last):
File "trainer_temp.py", line 226, in <module>
main()
File "trainer_temp.py", line 137, in main
model = multi_gpu_model(build.models['vae'], gpus=gpus)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/utils/multi_gpu_utils.py", line 227, in multi_gpu_model
outputs = model(inputs)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/base_layer.py", line 457, in __call__
output = self.call(inputs, **kwargs)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 564, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/engine/network.py", line 721, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/layers/normalization.py", line 185, in call
epsilon=self.epsilon)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 1858, in normalize_batch_in_training
if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 291, in _has_nchw_support
explicitly_on_cpu = _is_current_explicit_device('CPU')
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 266, in _is_current_explicit_device
device = _get_current_tf_device()
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 247, in _get_current_tf_device
g._apply_device_functions(op)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 4261, in _apply_device_functions
op._set_device(device_spec.function(op))
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 314, in _device_function
current_device = DeviceSpec.from_string(node_def.device or "")
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 232, in from_string
return DeviceSpec().parse_from_string(spec)
File "/home/fh2-project-devel/pm7014/.virtualenvs/lala/lib/python3.6/site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string
splits = [x.split(":") for x in spec.split("/")]
AttributeError: 'DeviceSpec' object has no attribute 'split'
Any suggestions whether this is caused by the same issue or if I might have another problem?
System information
Same problem here. Log:
Traceback (most recent call last):
File "phase_retrieval_-108_gan.py", line 39, in <module>
discriminator = multi_gpu_model( discriminator, gpus=2 )
File "/usr/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py", line 230, in multi_gpu_model
outputs = model(inputs)
File "/usr/lib/python3.7/site-packages/keras/engine/base_layer.py", line 451, in __call__
output = self.call(inputs, **kwargs)
File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 570, in call
output_tensors, _, _ = self.run_internal_graph(inputs, masks)
File "/usr/lib/python3.7/site-packages/keras/engine/network.py", line 727, in run_internal_graph
layer.call(computed_tensor, **kwargs))
File "/usr/lib/python3.7/site-packages/keras/layers/normalization.py", line 185, in call
epsilon=self.epsilon)
File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 2053, in normalize_batch_in_training
if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 299, in _has_nchw_support
explicitly_on_cpu = _is_current_explicit_device('CPU')
File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 272, in _is_current_explicit_device
device = _get_current_tf_device()
File "/usr/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py", line 252, in _get_current_tf_device
g._apply_device_functions(op)
File "/usr/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 4581, in _apply_device_functions
op._set_device_from_string(device_string)
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
I am trying to learn Pytorch.... little by little.... I dont kow when they will fix this.... it have been three months
same problem here too...
AttributeError Traceback (most recent call last)
10 opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
11 #opt = tf.compat.v1.train.experimental.enable_mixed_precision_graph_rewrite(opt,loss_scale='dynamic')
---> 12 parallel_model = multi_gpu_model(model, gpus=2)
13 #parallel_model.compile(loss='categorical_crossentropy',optimizer='rmsprop')
14 parallel_model.compile(optimizer=opt, loss=bce_dice_loss, metrics=[dice_coef])
/opt/conda/lib/python3.7/site-packages/keras/utils/multi_gpu_utils.py in multi_gpu_model(model, gpus, cpu_merge, cpu_relocation)
225 # Apply model on slice
226 # (creating a model replica on the target device).
--> 227 outputs = model(inputs)
228 outputs = to_list(outputs)
229
/opt/conda/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, *kwargs)
455 # Actually call the layer,
456 # collecting output(s), mask(s), and shape(s).
--> 457 output = self.call(inputs, *kwargs)
458 output_mask = self.compute_mask(inputs, previous_mask)
459
/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in call(self, inputs, mask)
562 return self._output_tensor_cache[cache_key]
563 else:
--> 564 output_tensors, _, _ = self.run_internal_graph(inputs, masks)
565 return output_tensors
566
/opt/conda/lib/python3.7/site-packages/keras/engine/network.py in run_internal_graph(self, inputs, masks)
719 kwargs['mask'] = computed_mask
720 output_tensors = to_list(
--> 721 layer.call(computed_tensor, **kwargs))
722 output_masks = layer.compute_mask(computed_tensor,
723 computed_mask)
/opt/conda/lib/python3.7/site-packages/keras/layers/normalization.py in call(self, inputs, training)
183 normed_training, mean, variance = K.normalize_batch_in_training(
184 inputs, self.gamma, self.beta, reduction_axes,
--> 185 epsilon=self.epsilon)
186
187 if K.backend() != 'cntk':
/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in normalize_batch_in_training(x, gamma, beta, reduction_axes, epsilon)
1856 """
1857 if ndim(x) == 4 and list(reduction_axes) in [[0, 1, 2], [0, 2, 3]]:
-> 1858 if not _has_nchw_support() and list(reduction_axes) == [0, 2, 3]:
1859 return _broadcast_normalize_batch_in_training(x, gamma, beta,
1860 reduction_axes,
/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _has_nchw_support()
289 bool: if the current scope device placement would support nchw
290 """
--> 291 explicitly_on_cpu = _is_current_explicit_device('CPU')
292 gpus_available = len(_get_available_gpus()) > 0
293 return (not explicitly_on_cpu and gpus_available)
/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _is_current_explicit_device(device_type)
264 if device_type not in ['CPU', 'GPU']:
265 raise ValueError('device_type should be either "CPU" or "GPU".')
--> 266 device = _get_current_tf_device()
267 return (device is not None and device.device_type == device_type.upper())
268
/opt/conda/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py in _get_current_tf_device()
245 g = tf.get_default_graph()
246 op = _TfDeviceCaptureOp()
--> 247 g._apply_device_functions(op)
248 return op.device
249
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in _apply_device_functions(self, op)
4579 # strings, since identity checks are faster than equality checks.
4580 if device_string is not prior_device_string:
-> 4581 op._set_device_from_string(device_string)
4582 prior_device_string = device_string
4583 op._device_code_locations = self._snapshot_device_function_stack_metadata()
AttributeError: '_TfDeviceCaptureOp' object has no attribute '_set_device_from_string'
yes, TF 1.14 issue, see https://github.com/tensorflow/tensorflow/issues/30728
i found a workaround (which worked at least for my configuration):
since parse_from_string treated the DeviceSpec object it wanted to receive as an string, there was a simple solution, adding the following in
site-packages/tensorflow/python/framework/device.py", line 150, in parse_from_string:
if isinstance(spec, DeviceSpec):
return spec
might not be the optimal solution, but it finally allowed me to use all of my GPUs...
@ju-he thanks for the post, but I cant find that line of code, I dont have anything in line 150, just some comments....
I have this
if isinstance(spec, MergeDevice):
return spec
but never found
parse_from_string in the file
I don't understand what did you changed...
Thanks.
@TheStoneMX which tf version are you using? I downgraded from 1.14 to 1.12 and to 1.10, since some suggested this solution but I still had the issues as described above. Maybe it's due to my specific configuration. But do I understand you correctly, that the "if isinstance-return spec" part (that's the only thing I added) is already there in your version? Then apparently this bug has already been fixed, just not in the version I was using.
@ju-he I am using 1.14 and still can't use multiple GPUs, I am thinking to learn Pytorch..... it has been too long and they are not fixing this bug
@TheStoneMX have you tried switching to tf 1.12 or even 1.10? This together with the Bugfix I posted above should work fine.
Hi @ju-he
Thanks for the email, but I havent be able, I am using conda and everytim I install keras, it reinstall 1.4....
Do you know how I can do it ?
But I thought that 1.3 does not have this problem, because before everything was working.
Hi @TheStoneMX
I stopped using conda a while ago, but I think you can choose the version of a specific package via conda install package=version
I was not using conda for a while then I started to use it again, I am switching to see if I can make work, thanks bro.
Hi, same issue here when trying to use multi GPU with Keras.
I had to fall back to tensorflow-gpu 1.13.2 to make it work :(
Any news on this? Hope things get fixed soon :+1:
Hi @ju-he I got it working removing anaconda and using pip3 and installed TensorFlow-GPU 1.13.2
I have the same problem when calling:
with device('/gpu:0' if use_GPU else '/cpu:0'):
portion of code
Tensorflow-gpu 1.14 has disappointed me as well. I consider 1.13.2 a last reliable version.
Just importing it causes incompatibilities:
- for example with numpy,
- with management of GPUs / CPUs.
Many things have changed package path and there is no backwards compatibility, for example:
- package path to TocoConverter/TFLiteConverter
- package path to set_image_dim_ordering
- many other places get warning like for example "tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead"
I believe 1.14 is currently more similar to TensorFlow 2 rather than to TensorFlow 1.
Why would there be explicit necessity to change package path to "v1" otherwise.
Please consider backwards compatibility for 1.x.x versions if the version still starts with 1.
I, too, got it working by installing a pip3 environment separate from my Anaconda environment.
I, too, got it working by installing a pip3 environment separate from my Anaconda environment.
- pip3 for python 3.7.5
- tensorflow-gpu v. 1.14.0
- keras 2.3.1
- cuda 10.0 libraries only into /usr/local/cuda-10.0 in addition to cuda 10.1 + drivers that were previously installed
Just upgrade tf version to 1.15, works for me.
Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4
Error is triggered for me on Tensorflow-GPU 1.15 with Keras 2.2.4
keras 2.3.1 pypi_0 pypi
tensorboard 1.15.0 pypi_0 pypi
tensorflow-estimator 1.15.1 pypi_0 pypi
tensorflow-gpu 1.15.0 pypi_0 pypi
Tested on my computer.
Ubuntu 19.10 , gtx1080ti sli, python3.7, cuda 10.1
keras 2.2.4
tensorflow 1.13.1
it works for me.
Most helpful comment
@QtacierP It works with TF 1.13 and CUDA 10.0 for me, its just TF 1.14 that's a problem