Keras: When will Keras support multi-GPU?

Created on 15 Sep 2017  Â·  14Comments  Â·  Source: keras-team/keras

With the data set becoming bigger and the model becoming larger, using multi-GPU is a a solution to reduce training time. So, when will Keras support multi-GPU? Thanks!

Most helpful comment

Hey guys!
Keras does support multi-gpu training in both data-parallel and model-parallel modes but only using the tensorflow backend.

  1. Data Parallelism
    For a larger dataset you can look at data parallelism.
    This basically takes your minibatch and splits it into n chunks (n being the number of gpus you have) and runs the forward pass for each gpu_batch and then concatenates the prediction vectors on your cpu to take loss etc. The speed-up is NOT linear but there is a notable difference.
    Link: https://keras.io/utils/#multi_gpu_model

  2. Model Parallelism
    If you want to fit a bigger model on multiple gpus you can look at model parallelism. The idea is to have different parts of your model on each gpu, and you can do this using tensorflow's support for scoping.
    link: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

All 14 comments

@freshmanfresh I have the same problem, have you get some solution?

Hey guys!
Keras does support multi-gpu training in both data-parallel and model-parallel modes but only using the tensorflow backend.

  1. Data Parallelism
    For a larger dataset you can look at data parallelism.
    This basically takes your minibatch and splits it into n chunks (n being the number of gpus you have) and runs the forward pass for each gpu_batch and then concatenates the prediction vectors on your cpu to take loss etc. The speed-up is NOT linear but there is a notable difference.
    Link: https://keras.io/utils/#multi_gpu_model

  2. Model Parallelism
    If you want to fit a bigger model on multiple gpus you can look at model parallelism. The idea is to have different parts of your model on each gpu, and you can do this using tensorflow's support for scoping.
    link: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

@akshaychawla Thanks a lot! I will try it later

@akshaychawla Hi, I met a problem when using 'multi_gpu_model'. The error is 'module' object has no attribute 'multi_gpu_model' when I run 'from keras.utils import multi_gpu_model'. The Keras version I installed is 1.2.2. Any ideas about this issue?

@mingliking you should use the development version of Keras (from Github) to get access to this feature. It will be part of the next PyPI release (2.1.0).

@mingliking hey you should also probably upgrade to keras 2.0.8. Just run "pip install --upgrade keras". The API is different but there are a ton of new features. This will not give you the development version.
If you're unwilling to install the dev version from github, you could actually just copy and paste the training utils file into your project directory and change the first 4 lines of imports. Then you can do the following and it should work.
from training_utils import multi_gpu_model
I had to do something similar as the project I'm working on uses an older version of keras. Also, you might want to look at this link which has a similar implementation of data parallelism and the related article which explains it.

@fchollet
Not sure if I should open a new issue. Sorry if it doesn't belong here.

I installed keras from source and tried the multi_gpu_model on my Azure VM with 4 gpus. Here is the error I get:

To call `multi_gpu_model` with `gpus=4`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']. However this machine only has: ['/device:CPU:0', '/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']. Try reducing `gpus`.

@imsedim please open a new issue to track it. Include the full configuration of the VM you are running on. For now, the utility has only been tested on Unix systems, but it seems like extending that would be an easy fix.

The above issue is not about platform. In tensorflow you cannot compare devices by string comparison, because for example '/device:GPU:0' and '/gpu:0' refer to the same device.

See: https://www.tensorflow.org/tutorials/using_gpu

Since the same code works on one system and doesn't on a different system (which I assume features a different OS), then it is a platform issue. The fix will lie in normalizing device names.

It works because of good luck.. certain version of TF uses one name but some use the other name. Sure you can call it a platform issue.
The name can be normalized by one line: tf.DeviceSpec.from_string(name).to_string().

@ppwwyyxx that gives it in the format "/device:CPU:0". Wouldn't it be better to retain the "/cpu:0" format since we anyways use that for variable scopes?
It could be normalized to "/cpu:0" by running device_name.replace("device:","").lower() for each device name.

Sure you can use one way or another. Just want to mention that tf.DeviceSpec.from_string(name).to_string() is what tensorflow is officially using to normalize the name.

I can see that somebody has created the issue already #8213

Was this page helpful?
0 / 5 - 0 ratings