Keras: When will Keras support multi-GPU?

Created on 15 Sep 2017 · 14Comments · Source: keras-team/keras

With the data set becoming bigger and the model becoming larger, using multi-GPU is a a solution to reduce training time. So, when will Keras support multi-GPU? Thanks!

Source

freshmanfresh

👍7

Most helpful comment

Hey guys!
Keras does support multi-gpu training in both data-parallel and model-parallel modes but only using the tensorflow backend.

Data Parallelism
For a larger dataset you can look at data parallelism.
This basically takes your minibatch and splits it into n chunks (n being the number of gpus you have) and runs the forward pass for each gpu_batch and then concatenates the prediction vectors on your cpu to take loss etc. The speed-up is NOT linear but there is a notable difference.
Link: https://keras.io/utils/#multi_gpu_model
Model Parallelism
If you want to fit a bigger model on multiple gpus you can look at model parallelism. The idea is to have different parts of your model on each gpu, and you can do this using tensorflow's support for scoping.
link: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

akshaychawla on 17 Oct 2017

👍6

All 14 comments

@freshmanfresh I have the same problem, have you get some solution？

liuwei16 on 28 Sep 2017

Hey guys!
Keras does support multi-gpu training in both data-parallel and model-parallel modes but only using the tensorflow backend.

Data Parallelism
For a larger dataset you can look at data parallelism.
This basically takes your minibatch and splits it into n chunks (n being the number of gpus you have) and runs the forward pass for each gpu_batch and then concatenates the prediction vectors on your cpu to take loss etc. The speed-up is NOT linear but there is a notable difference.
Link: https://keras.io/utils/#multi_gpu_model
Model Parallelism
If you want to fit a bigger model on multiple gpus you can look at model parallelism. The idea is to have different parts of your model on each gpu, and you can do this using tensorflow's support for scoping.
link: https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

akshaychawla on 17 Oct 2017

👍6

@akshaychawla Thanks a lot! I will try it later

freshmanfresh on 17 Oct 2017

@akshaychawla Hi, I met a problem when using 'multi_gpu_model'. The error is 'module' object has no attribute 'multi_gpu_model' when I run 'from keras.utils import multi_gpu_model'. The Keras version I installed is 1.2.2. Any ideas about this issue?

mingliking on 17 Oct 2017

@mingliking you should use the development version of Keras (from Github) to get access to this feature. It will be part of the next PyPI release (2.1.0).

fchollet on 17 Oct 2017

@mingliking hey you should also probably upgrade to keras 2.0.8. Just run "pip install --upgrade keras". The API is different but there are a ton of new features. This will not give you the development version.
If you're unwilling to install the dev version from github, you could actually just copy and paste the training utils file into your project directory and change the first 4 lines of imports. Then you can do the following and it should work.
from training_utils import multi_gpu_model
I had to do something similar as the project I'm working on uses an older version of keras. Also, you might want to look at this link which has a similar implementation of data parallelism and the related article which explains it.

akshaychawla on 18 Oct 2017

@fchollet
Not sure if I should open a new issue. Sorry if it doesn't belong here.

I installed keras from source and tried the multi_gpu_model on my Azure VM with 4 gpus. Here is the error I get:

To call `multi_gpu_model` with `gpus=4`, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']. However this machine only has: ['/device:CPU:0', '/device:GPU:0', '/device:GPU:1', '/device:GPU:2', '/device:GPU:3']. Try reducing `gpus`.

imsedim on 18 Oct 2017

@imsedim please open a new issue to track it. Include the full configuration of the VM you are running on. For now, the utility has only been tested on Unix systems, but it seems like extending that would be an easy fix.

fchollet on 18 Oct 2017

The above issue is not about platform. In tensorflow you cannot compare devices by string comparison, because for example '/device:GPU:0' and '/gpu:0' refer to the same device.

ppwwyyxx on 18 Oct 2017

See: https://www.tensorflow.org/tutorials/using_gpu

Since the same code works on one system and doesn't on a different system (which I assume features a different OS), then it is a platform issue. The fix will lie in normalizing device names.

fchollet on 18 Oct 2017

It works because of good luck.. certain version of TF uses one name but some use the other name. Sure you can call it a platform issue.
The name can be normalized by one line: tf.DeviceSpec.from_string(name).to_string().

ppwwyyxx on 18 Oct 2017

@ppwwyyxx that gives it in the format "/device:CPU:0". Wouldn't it be better to retain the "/cpu:0" format since we anyways use that for variable scopes?
It could be normalized to "/cpu:0" by running device_name.replace("device:","").lower() for each device name.

akshaychawla on 18 Oct 2017

Sure you can use one way or another. Just want to mention that tf.DeviceSpec.from_string(name).to_string() is what tensorflow is officially using to normalize the name.

ppwwyyxx on 18 Oct 2017

I can see that somebody has created the issue already #8213

imsedim on 22 Oct 2017

Was this page helpful?

0 / 5 - 0 ratings