Keras: Does Keras support using multiple GPUs?

Created on 21 Apr 2016  Â·  162Comments  Â·  Source: keras-team/keras

Theano has supported multiple GPUs since v0.8.0.
(cf. Using multiple GPUs — Theano 0.8.0 documentation )
Does Keras also support using multiple GPUs?

For example, can I run the below task?

  1. Learn a sequential model A on gpu0
  2. Learn a sequential model B on gpu1
  3. Merge A and B on gpu0

Most helpful comment

Yes, can run Keras models on multiple GPUs. This is only possible with the TensorFlow backend for the time being, because the Theano feature is still rather new. We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).

With the TensorFlow backend, you can achieve this the same way as you would in pure TensorFlow: by using the with tf.device(d) scope when defining Keras layers.

All 162 comments

Yes, can run Keras models on multiple GPUs. This is only possible with the TensorFlow backend for the time being, because the Theano feature is still rather new. We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).

With the TensorFlow backend, you can achieve this the same way as you would in pure TensorFlow: by using the with tf.device(d) scope when defining Keras layers.

We are looking at adding support for multi-gpu in Theano in the near future (it should be fairly straightforward).

I'm looking forward to it 😃
Thank you.

tf.device() scope?
Can you expand on this?
I haven't seen it in the api

Any example to use multiple gpus with TF?

Hm. Theano has libgpuarray, which allows one to push shared variables to different devices. This will not do all the work for you of recombining weight matrices but with a little effort you could use multiple GPUs.

There is platoon, a project on top of Theano for data parallelism. Should
be easy to use. We currently focus more on days parallelism then model
parallelism in Theano. But both are possible.

Fred
Le 23 avr. 2016 17:24, "phalexo" [email protected] a écrit :

Hm. Theano has libgpuarray, which allows one to push shared variables to
different devices. This will not do all the work for you of recombining
weight matrices but with a little effort you could use multiple GPUs.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-213834678

I have looked into Platoon and it seemed like it was pretty much compatible
with Keras out of the box except for a couple lines of code. Easy to adapt,
in any case...

On 25 April 2016 at 05:46, Frédéric Bastien [email protected]
wrote:

There is platoon, a project on top of Theano for data parallelism. Should
be easy to use. We currently focus more on days parallelism then model
parallelism in Theano. But both are possible.

Fred
Le 23 avr. 2016 17:24, "phalexo" [email protected] a écrit :

Hm. Theano has libgpuarray, which allows one to push shared variables to
different devices. This will not do all the work for you of recombining
weight matrices but with a little effort you could use multiple GPUs.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-213834678

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-214299393

The way libgpuarray work is by mapping variables to different GPUs, and
then function automatically generates code to transfer data between GPUs as
needed.
On Apr 25, 2016 16:13, "François Chollet" [email protected] wrote:

I have looked into Platoon and it seemed like it was pretty much compatible
with Keras out of the box except for a couple lines of code. Easy to adapt,
in any case...

On 25 April 2016 at 05:46, Frédéric Bastien [email protected]
wrote:

There is platoon, a project on top of Theano for data parallelism. Should
be easy to use. We currently focus more on days parallelism then model
parallelism in Theano. But both are possible.

Fred
Le 23 avr. 2016 17:24, "phalexo" [email protected] a écrit :

Hm. Theano has libgpuarray, which allows one to push shared variables
to
different devices. This will not do all the work for you of recombining
weight matrices but with a little effort you could use multiple GPUs.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-213834678

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-214299393

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-214450940

I have looked into Platoon and it seemed like it was pretty much compatible
with Keras out of the box except for a couple lines of code. Easy to adapt,
in any case...

What's the priority of adding multi GPU support for the theano backend?

I think it would expand user base for Keras. I have several Titan X in the
same box. Please, take a look at libgpuarray as well.
On Jun 22, 2016 19:54, "themummy" [email protected] wrote:

I have looked into Platoon and it seemed like it was pretty much compatible
with Keras out of the box except for a couple lines of code. Easy to adapt,
in any case...

What's the priority of adding multi GPU support for the theano backend?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-227911398,
or mute the thread
https://github.com/notifications/unsubscribe/AEY95aBPElrTcVv0ZzPFyDgcDKMaw-0iks5qOcsXgaJpZM4IMTcS
.

How does this actually work in tensorflow? There is a brief tutorial here: http://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html, I understand the concept of running the model replicas on seperate GPU devices & then merging the weights, but how do we actually run this? instead of model.fit do we call merged.fit on the result of the merged models?

@tetmin I have the same confusion as yours. Although the blog shows how to predict model in different GPUs, it is still unclear how to train the same model across different GPUs in a single machine, i.e. I need data parallelism and don't know how to implement it in Keras with TensorFlow as backend.

Agreed with @pengpaiSH and @tetmin . Hope there would be more details.

@rudaoshi Well, I know this would not be proper to suggest since we are in the Keras community, and personally I am a Big Big Big fan of Keras! We know TensorFlow could utilize Multi-GPUs by computing averaging gradients across different devices, however, I am expecting Keras could provide a simple and unified API (Keras's Style) to help me focus my big picture end hide those IO/Parallel Computing details. For the time being, in order to make good use of multiple GPUs, I am doing my deep learning programs with MXNET, which I only specify the GPU IDs and the lib will do everything it needs under the hood.

@fchollet I saw your blog with multi gpu training, thanks for pointing out the way doing multi gpu training, but I would really appreciate it if say model.fit() has a gpu=n option, I'll willing to implement my own version on that, may I ask for suggestions? or I'm willing to contribute on the multi gpu training within keras with more abstraction from end users. Thanks in advance!

@WenchenLi +1, gpus=0,1,2... is exactly what I need!

@WenchenLi did you create a PR for multigpu abstraction?

Hope someone can contribute on the multi gpu training within keras. Thanks in advance.

I have two gpus. I did not do anything to set which gpu would be used for training. But when I used the nvidia-smi to check memory. I found almost all of the memory in two gpus were in use. I thought only one gpu would be used.

@anewlearner apparently this is the intended functionality of TF.
Use export CUDA_VISIBLE_DEVICES="0".

See https://github.com/tensorflow/tensorflow/issues/5066 for details

Looking forward to a simplified version of mult-gpu :)

For data parallelization in keras, you can use this approach:

import tensorflow as tf

from keras import backend as K

from keras.models import Model

from keras.layers import Input, merge

from keras.layers.core import Lambda

def slice_batch(x, n_gpus, part):

sh = K.shape(x)

L = sh[0] / n_gpus

if part == n_gpus - 1:

    return x[part*L:]

return x[part*L:(part+1)*L]

def to_multi_gpu(model, n_gpus=2):

with tf.device('/cpu:0'):

    x = Input(model.input_shape[1:], name=model.input_names[0])


towers = []

for g in range(n_gpus):

    with tf.device('/gpu:' + str(g)):

        slice_g = Lambda(slice_batch, lambda shape: shape,

arguments={'n_gpus':n_gpus, 'part':g})(x)

        towers.append(model(slice_g))


    with tf.device('/cpu:0'):

        merged = merge(towers, mode='concat', concat_axis=0)


return Model(input=[x], output=merged)

To use just take any model and set model = to_multi_gpu(model).

model.fit() and model.predict() should work without any change.

On Fri, Oct 21, 2016 at 6:13 PM, Alexander [email protected] wrote:

@anewlearner https://github.com/anewlearner apparently this is the
intended functionality of TF.
Use export CUDA_VISIBLE_DEVICES="0".

See tensorflow/tensorflow#5066
https://github.com/tensorflow/tensorflow/issues/5066 for details

Looking forward to a simplified version of mult-gpu :)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-255404169,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCAoGsCYxA9CVIN7IIJqX7ddkxEurks5q2NaagaJpZM4IMTcS
.

@jonilaserson , looks great! Does this work with the Theano backend or only TF?

@jonilaserson Could you please provide more detailed comments for the codes?
For example, what's the purpose of slice_g? And what does tower do actually? Thank you !

I tested the code provided by @jonilaserson and got a error.
merged = merge(towers, mode='concat', concat_axis=0)
Exception: A Merge should only be applied to a list of layers with at least 2 elements. Found: [<keras.engine.training.Model object at 0x7f9c1c3123d0>]

@anewlearner Have you solved the problem that you met with before?

@carol

There was an indentation error in the code I posted.
The [with tf.device('/cpu:0')] paragraph should be outside the loop.

Here is a piece of code that should work:

import tensorflow as tf

from keras import backend as K

from keras.models import Model

from keras.layers import Input, merge

from keras.layers.core import Lambda

def slice_batch(x, n_gpus, part):

"""

Divide the input batch into [n_gpus] slices, and obtain slice no.

[part].

i.e. if len(x)=10, then slice_batch(x, 2, 1) will return x[5:].

"""

sh = K.shape(x)

L = sh[0] / n_gpus

if part == n_gpus - 1:

    return x[part*L:]

return x[part*L:(part+1)*L]

def to_multi_gpu(model, n_gpus=2):

"""Given a keras [model], return an equivalent model which parallelizes

the computation over [n_gpus] GPUs.



Each GPU gets a slice of the input batch, applies the model on that

slice

and later the outputs of the models are concatenated to a single

tensor,

hence the user sees a model that behaves the same as the original.

"""

with tf.device('/cpu:0'):

    x = Input(model.input_shape[1:], name=model.input_names[0])


towers = []

for g in range(n_gpus):

    with tf.device('/gpu:' + str(g)):

        slice_g = Lambda(slice_batch, lambda shape: shape,

arguments={'n_gpus':n_gpus, 'part':g})(x)

        towers.append(model(slice_g))


with tf.device('/cpu:0'):

    merged = merge(towers, mode='concat', concat_axis=0)


return Model(input=[x], output=merged)

To use just take any model and set model = to_multi_gpu(model).

model.fit() and model.predict() should work without any change.

#

Example:

from keras.layers.convolutional import Convolution2D

from keras.layers.core import Activation

import numpy as np

def get_model():

x = Input( (96,96,1), name="input1")

output = Convolution2D(64, 5, 5, border_mode='same', name="conv1")(x)

output = Activation('relu', name="relu1")(output)

# [More layers...]

model = Model(input=x, output=output)
model.compile(optimizer='rmsprop', loss='mse')
return model

model = get_model()
model = to_multi_gpu(model)

x = np.random.rand(1000, 96, 96, 1)
y = model.predict(x, verbose=True)

On Mon, Oct 31, 2016 at 10:18 AM, Pai Peng [email protected] wrote:

@anewlearner https://github.com/anewlearner Have you solved the problem
that you met with before?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-257228011,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCCOCSqV-6UR7QtUN7Gv8YGe73u03ks5q5ZZegaJpZM4IMTcS
.

@jonilaserson Thank you for you updating! Would you please comments on the code snippets

 for g in range(n_gpus):
        with tf.device('/gpu:' + str(g)):
            slice_g = Lambda(slice_batch, lambda shape: shape, 
                            arguments={'n_gpus':n_gpus, 'part':g})(x)
            towers.append(model(slice_g))

@jonilaserson
Thanks for sharing your code. It works. :)
I tested code to compare the cost of time bettween a gpu and two gpus.
When I used two gpus(same type of gpus here), the speed was slower than expected. Does the switch bettween cpu and gpu affects the speed?
My test result is as follows.

Two gpus

97650/682307 [===>..........................] - ETA: 1933s - loss: 0.3320 - acc: 0.8320
188593/682307 [=======>......................] - ETA: 1654s - loss: 0.2354 - acc: 0.8904
279093/682307 [===========>..................] - ETA: 1348s - loss: 0.1936 - acc: 0.9140

One gpu

97650/682307 [===>..........................] - ETA: 2669s - loss: 0.3488 - acc: 0.8266
188593/682307 [=======>......................] - ETA: 2239s - loss: 0.2431 - acc: 0.8880
279093/682307 [===========>..................] - ETA: 1844s - loss: 0.2004 - acc: 0.9116

I think you should compile the model in case of a error: you must compile the model before training/testing.

@jonilaserson
Thanks for your code. But I got the error.

Traceback (most recent call last):
  File "mgpumain.py", line 329, in <module>
    main()
  File "mgpumain.py", line 165, in main
    train_val_test()
  File "mgpumain.py", line 315, in train_val_test
    train_model_runner(n)
  File "mgpumain.py", line 232, in train_model_runner
    model = to_multi_gpu(model,4)
  File "mgpumain.py", line 200, in to_multi_gpu
    slice_g = Lambda(slice_batch, lambda shape: shape, arguments={'n_gpus':n_gpus, 'part':g})(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 514, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 149, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/core.py", line 556, in call
    return self.function(x, **arguments)
  File "mgpumain.py", line 186, in slice_batch
    return x[part*L:(part+1)*L]
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 383, in _SliceHelper
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 537, in strided_slice
    shrink_axis_mask=shrink_axis_mask)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2750, in strided_slice
    shrink_axis_mask=shrink_axis_mask, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 573, in apply_op
    _Attr(op_def, input_arg.type_attr))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
    ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: DataType float64 for attr 'Index' not in list of allowed values: int32, int64

@anewlearner
I received better speedups the larger the batches were, but yes, there must be some overhead concatenating the outputs on the CPU. I assume we can get additional speedup if the computation would be kept on the GPU as far as the loss function (using the CPU just to add the losses), but this would mean you will need to use the native tensorflow optimizer instead of the keras one, so more work and less backward compatibility.

@pengpaiSH

 for g in range(n_gpus):
        # Work on GPU number g.
        with tf.device('/gpu:' + str(g)):
            # Obtain the g-th slice of the batch.
            slice_g = Lambda(slice_batch, lambda shape: shape, 
                            arguments={'n_gpus':n_gpus, 'part':g})(x)
            # Apply model on the batch slice.
            towers.append(model(slice_g))

@barrykui
It seems L is not an integer, perhaps the division in the line
L = sh[0] / n_gpus
resulted in a fraction. It does not suppose to happen if both n_gpus and sh[0] are int32, check if that's the case.

@themummy
Sorry, I'm using TF specific idioms (with tf.device()), no Theano.

@jonilaserson - My models train 4X slower with TF backend than Theano :\

@jonilaserson I get the same not an int error. When I look at that var sh it is actually None, as in the None that goes to indicate variable batch size.
For instance, I get that sh is (None, 40, 120).

What's TF version are you using?

On Wed, Nov 2, 2016 at 11:27 PM, Fernando Cucchietti <
[email protected]> wrote:

@jonilaserson https://github.com/jonilaserson I get the same not an int
error. When I look at that var sh it is actually None, as in the None that
goes to indicate variable batch size.
For instance, I get that sh is (None, 40, 120).

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-258004336,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCFsFhlInrsFwQmi1-Y-LG_rKEDyjks5q6QBEgaJpZM4IMTcS
.

It seems that 0.11.0rc0.
It's a shared machine so upgrading might take a couple of days.

@jonilaserson Thanks for your reply. Could you please give more details what does the slice_g do and why it should be a Lambda layer if it attempts to get a piece of batch data? Besides, Why should we concatenate multiple results computed in different GPUs, instead of averaging them?

@jonilaserson you are right. The information shows below. I really confused, and don't know how to solve this problem.

>>print(K.shape(x))
Tensor("Shape_3:0", shape=(3,), dtype=int32, device=/device:GPU:0)
>>print(sh[0])
Tensor("strided_slice_2:0", shape=(), dtype=int32, device=/device:GPU:0)   
>>print(type(sh))
<class 'tensorflow.python.framework.ops.Tensor'>   
>>print(L)
Tensor("truediv:0", shape=(), dtype=float64, device=/device:GPU:0)  

@barrykui
Can you try doing floor division?
L = sh[0] // n_gpus

@jonilaserson Yeah! but still float64.

a = sh[0] // n_gpus 
x = Tensor("input1_1:0", shape=(?, 100, 4), dtype=float64, device=/device:CPU:0)
n_gpus = 2
part = 0

@jonilaserson I confirm that with Tensorflow 0.11.0rc2 (latest version) still triggers the error.

Floor division won't work either because it starts as None.

>>print (K.int_shape(x))
(None, 40, 120)
>>print (K.shape(x))
Tensor("Shape_2:0", shape=(3,), dtype=int32, device=/device:GPU:0)
>>sh=K.shape(x)
>> print(sh)
Tensor("Shape_3:0", shape=(3,), dtype=int32, device=/device:GPU:0)
>>print(sh[0])
Tensor("strided_slice_11:0", shape=(), dtype=int32, device=/device:GPU:0)

edit: Why is this code called during model creation instead of training or prediction? At compile time you have no idea what batch size you are going to get, right?

@fercook Not the batch size, it's the channel size(or slice size), maybe,precisely, we called it as width.

x = Input( (96,96,1), name="input1")
...
x = Input(model.input_shape[1:], name=model.input_names[0])

So it is run on multiple slices(channel), not the sample. Right?

@themummy try using data format (batchsize, height, width, channels) for tensorflow. it improves speed

@anewlearne My experimental results are similar to yours. When I use one single GPU, the mnist_cnn.py takes 104 seconds for running 12 epochs while 77 seconds with two GPUs, batch_size=128.

@pengpaiSH

The speed improvement is not desired. Maybe we can try tensorflow. More work, faster speed.

Besides, I have some doubts about the jonilaserson's code.

def slice_batch(x, n_gpus, part):
    """
    Divide the input batch into [n_gpus] slices, and obtain slice no.[part].
    i.e. if len(x)=10, then slice_batch(x, 2, 1) will return x[5:].
    """
    sh = K.shape(x)
    L = sh[0] / n_gpus
    if part == n_gpus - 1:
        return x[part*L:]
    return x[part*L:(part+1)*L]

Assume this is a 2D case, the shape of x is (dim1/rows, dim2/cols, channels/num_featuremap). In slice_batch, the dim1/rows is divided instead of channes/num_featuremap. It is strange.
I want to test the accuracy of the code. But all of my gpus are in use and I don't have time now. Can you compare the result of using multi-gpus(2 type: jonilaserson's code, jonilaserson's code with my change below) or only using one gpu? If you have time to do this, try more epochs, because of the random initialization.

Thanks.

To divide the samples according to channels, try

# 2d case
def slice_batch(x, n_gpus, part):
    sh = K.shape(x)
    L = sh[-1] / n_gpus
    if part == n_gpus - 1:
        return x[:, :, part*L]
    return x[:, :, part*L:(part+1)*L]

@anewlearner I think you are mixing up the data shape of x which is (number of samples, rows, cols, channels) assuming you are in the backend of TensorFlow. Thus, It has nothing to do with 2D case or 3D case.

@anewlearner Er...Why we should set x by skipping input_shape[0]?

@pengpaiSH
You are right :)

@anewlearner Totally confused now...

@anewlearner What if x = Input(model.input_shape, name=model.input_names[0])?

@pengpaiSH
It should be x = Input(model.input_shape[1:], name=model.input_names[0]).
model.input_shape = (None, rows, cols, channels).
In the definition of Input,

def Input(shape=None, batch_shape=None,
          name=None, dtype=K.floatx(),
          tensor=None):
    # Arguments
        shape: a shape tuple (integer), not including the batch size.
            For instance, `shape=(32,)` indicates that the expected input
            will be batches of 32-dimensional vectors.
        batch_shape: a shape tuple (integer), including the batch size.
            For instance, `batch_shape=(10, 32)` indicates that
            the expected input will be batches of 10 32-dimensional vectors.
            `batch_shape=(None, 32)` indicates batches of an arbitrary number
            of 32-dimensional vectors.

@anewlearner Thanks for your explanation. Besides, would you please explain why a lambda layer could distribute the computation over different GPUs by splitting the batch data? slice_g = Lambda(slice_batch, lambda shape: shape, arguments={'n_gpus':n_gpus, 'part':g})(x)

@pengpaiSH Dear peng, my wechat id is xukui1347. Could we make friends on wechat? I would benefit a lot from you. Thank you.

@barrykui Sure, done!

@pengpaiSH
It might be used to divided the input. Lamda is a layer in tensorflow.
In the definition of Lamda,

def __init__(self, function, output_shape=None, arguments={}, **kwargs):
function: The function to be evaluated.
'''
# Arguments
function: The function to be evaluated.
    Takes input tensor as first argument.
output_shape: Expected output shape from function.
    Can be a tuple or function.
    If a tuple, it only specifies the first dimension onward;
         sample dimension is assumed either the same as the input:
         `output_shape = (input_shape[0], ) + output_shape`
         or, the input is `None` and the sample dimension is also `None`:
         `output_shape = (None, ) + output_shape`
    If a function, it specifies the entire shape as a function of the
    input shape: `output_shape = f(input_shape)`
arguments: optional dictionary of keyword arguments to be passed
    to the function.
'''

@jonilaserson Thanks for posting the above code. Although I second the question of @pengpaiSH : why do you simply concatenate the results returned from the GPUs? Doesn't it make sense to average these results, which is the typical protocol in batching? See for example the following two multi-gpu tensorflow examples, where the returned gradients are indeed explicitly averaged:

https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py#L137

https://github.com/tensorflow/models/blob/master/inception/inception/inception_train.py#L169

The results from the GPUs are not the loss values but the actual outputs
from the networks, so you can't average them. I don't know how to directly
average the losses and still work (and train) within the keras framework.

On Nov 14, 2016 8:29 PM, "samuelBB" [email protected] wrote:

@jonilaserson https://github.com/jonilaserson Thanks for posting the
above code. Although I second the question of @pengpaiSH
https://github.com/pengpaiSH : why do you simply concatenate the
results returned from the GPUs? Doesn't it make sense to average these
results, which is the typical protocol in batching? See for example the
following two multi-gpu tensorflow examples, where the returned gradients
are indeed explicitly averaged:

https://github.com/tensorflow/tensorflow/blob/r0.11/
tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py#L137

https://github.com/tensorflow/models/blob/master/inception/
inception/inception_train.py#L169

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-260419140,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCP6rJC8EAPXjMczLAx-ecp8XLglMks5q-KhugaJpZM4IMTcS
.

For the time being, in order to make good use of multiple GPUs, I am doing my deep learning programs with MXNET, which I only specify the GPU IDs and the lib will do everything it needs under the hood.

A simple example:

keras.backend.gpu_setup = ["gup0", "gpu1"...]

#Proceed to code in Keras as usual. All of the GPU complexity handled in the background.

Would hate to have to go to MXNET for this. Keras is truly an amazing library. This would make Keras the best DL library by far in my opinion.

For the time being, I found https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py to be a nice stand-alone solution to achieve data parallelism. Explained by the author here: https://medium.com/@kuza55/transparent-multi-gpu-training-on-tensorflow-with-keras-8b0016fd9012#.1x6yd12n3

But yeah, I too am greatly looking forward to bonafide multi-GPU integration!

@pGit1 This is exactly what I think and I need!!! +1

@samuelBB I could not open the link URL that you provided: https://medium.com/@kuza55/transparent-multi-gpu-training-on-tensorflow-with-keras-8b0016fd9012#.1x6yd12n3

@pengpaiSH Maybe try the link without "#.1x6yd12n3" at the end. Or just try searching "Transparent Multi-GPU Training on TensorFlow with Keras" on google, and it should be one of the top hits (form medium.com). Let me know if you get it, I can post the example code if you need...

@samuelBB This function is great, but I get errors for any model that contains batch normalization.

ValueError: Variable batchnormalization_2_running_mean/biased already exists, disallowed. Did you mean to set reuse=True in VarScope? Originally defined at:

I can't figure out a workaround so that I can use batch normalization. But if I remove the batch normalization layers, I do get a 2x speedup using 2 gpus (bn removed for both trials)!

@tstandley Try setting mode=2 for the batch normalization: BatchNormalization(mode=2). This will be slightly different behavior than the default, as explained in the Keras docs. If that fixes your issue, you can actually get the default mode=0 to work by updating your Keras to the latest commit from github (specifically, sha 771010f), which will probably be available in the next pypi release.

@samuelBB Dude, you're awesome! I've been spending so much time trying in vain to fix that issue. Setting the mode to 2 works.

I don't really understand the documentation on the modes. What's the difference? What is mode 1?

@samuelBB, the link you provided (from kuza55) is doing exactly what my
code is doing above.

On Thu, Dec 1, 2016 at 3:28 AM, samuelBB notifications@github.com wrote:

For the time being, I found https://github.com/kuza55/
keras-extras/blob/master/utils/multi_gpu.py to be a nice stand-alone
solution to achieve data parallelism. Explained by the author here:
https://medium.com/@kuza55/transparent-multi-gpu-
training-on-tensorflow-with-keras-8b0016fd9012#.1x6yd12n3

But yeah, I too am greatly looking forward to bonafide multi-GPU
integration!

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-264051515,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFdLCCutkDalDybXfJF0Lt6wfsvtNeaVks5rDiK8gaJpZM4IMTcS
.

@jonilaserson Indeed, I noticed that its the same method. However, when I tried to integrate your code from above into my pipeline, it gave some unclear errors; kuza55's code worked smoothly for me without modification (hence I didn't bother to figure out the issue). Your example worked fine on my machine, but I am using more complex models (e.g. resnets).

@jonilaserson I used to train my model on a single GPU with a batch size of 20 without a problem. But now when I tried training with 4 GPUs with 4x the batch size, I got Out of Memory error. As I understood it correctly, for data parallelism, the batch will be splitted into 4 parts and loaded onto 4 GPUs respectively. Could you shed some light on it?

@KlaymenGC @jonilaserson I experienced a similar thing. In the post I linked from kuza55 above, its recommended to multiply the batch size by the # of GPUs you're using. This seems to make sense, and yet, the program OOMs unless I reduce the batch size a lot, which is undesirable. I think I'm going to be sticking to one GPU for now, until the long-awaited built-in Keras support for multiple GPUs.

@samuelBB @KlaymenGC These methods implement data parallelism. With data parallelism, only neural activations can be spread across GPU's. Parameters must be copied to each GPU. Both parameters and activations take memory. For convolutional neural networks, the parameter cost is typically insignificant compared to the activations, so data parallelism works well for memory distribution so you can increase batch size almost linerly. On the other hand, parameters are the bulk of the memory usage for RNN's and densely connected layers. For stacked RNN's, model parallelism can sometimes be made to work well and can be used to store parameters on separate GPU's. Model parallelism is typically more difficult to deal with though, and you have to determine how to split your model manually. On the plus side, I think Keras already supports model parallelism. Just use with tf.device to determine which layers should run on which GPU's.

That said, it doesn't really make sense to wait for first party Keras support, because I don't think that will fix your OOM issue. When first party Keras support does land, it'll be in the form of data parallelism and or model parallelism and have the same problems you have now.

well if that is true that is just sad...and makes an mxnet backend that
much more appealing.

On Thu, Dec 15, 2016 at 4:37 PM, tstandley notifications@github.com wrote:

@samuelBB https://github.com/samuelBB @KlaymenGC
https://github.com/KlaymenGC These methods implement data parallelism.
With data parallelism, only neural activations can be spread across GPU's.
Parameters must be copied to each GPU. Both parameters and activations take
memory. For convolutional neural networks, the parameter cost is typically
insignificant compared to the activations, so data parallelism works well
for memory distribution so you can increase batch size almost linerly. On
the other hand, parameters are the bulk of the memory usage for RNN's and
densely connected layers. For stacked RNN's, model parallelism can
sometimes be made to work well and can be used to store parameters on
separate GPU's. Model parallelism is typically more difficult to deal with
though, and you have to determine how to split your model manually. On the
plus side, I think Keras already supports model parallelism. Just use with
tf.device to determine which layers should run on which GPU's.

That said, it doesn't really make sense to wait for first party Keras
support, because I don't think that will fix your OOM issue. When first
party Keras support does land, it'll be in the form of data parallelism and
or model parallelism and have the same problems you have now.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-267451474,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-SiwctTAyk3ZyvLShEOQfk7jykhhOks5rIbMcgaJpZM4IMTcS
.

@pGit1 I think you misunderstand me. It is not a limitation of Keras. This is how deep learning works. Unless you want to get VERY researchy, you have to choose data parallelism or model parallelism (or a combination). No backend change could fix that. These two options have their various drawbacks, and those drawbacks will exist no matter what engine you are using. Like I said, as of right now, you can use both parallelization techniques with Keras (with the TF backend).

On a more practical note, you should simply use the multi_gpu.py code (which implements data parallelism) and change the batch size until it runs. The batch size will be no smaller than it would for a single GPU, but you will get an almost linear speedup with more GPU's.

EDIT:I previously stated erroneously that multi_gpu.py used model parallelism. It does not. It very clearly uses data parallelism.

Thank you for the clarification. This makes sense!

On Thu, Dec 15, 2016 at 11:13 PM, tstandley notifications@github.com
wrote:

@pGit1 https://github.com/pGit1 I think you misunderstand me. It is not
a limitation of Keras. This is how deep learning works. Unless you want to
get VERY researchy, you have to choose data parallelism or model
parallelism (or a combination). No backend change could fix that. These two
options have their various drawbacks, and those drawbacks will exist no
matter what engine you are using. Like I said, as of right now, you can use
both parallelization techniques with Keras (with the TF backend).

On a more practical note, you should simply use the multi_gpu.py code
(which implements model parallelism) and change the batch size until it
runs. The batch size will be no smaller than it would for a single GPU, but
you will get an almost linear speedup with more GPU's.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-267513064,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-SlXiRiksPfANkzrrjwGamGGF6Wr5ks5rIg_igaJpZM4IMTcS
.

@tstandley isn't multi_gpu.py using data parallelism? As well as this: https://medium.com/@kuza55/transparent-multi-gpu-training-on-tensorflow-with-keras-8b0016fd9012#.hjr9mjdcn
The model is replicated on several GPUs, and the input batch is sliced.

@KlaymenGC Thanks for spotting my typo. I've fixed it. For the record, multi_gpu.py does indeed use data parallelism!

+1 to everything tstandley said about the batch size and data parallelization.

One comment though is that it could be better to do join the different GPU towers at the loss function level (instead of the model output level) since that means that the loss function computation happens in the CPU and not the GPU and might slow things down. @tstandley, what do you think?

I wonder why multi_gpu.py tells the CPU to do the concatenation. Not sure that affects where the loss calculation happens, but the bigger issue is where the gradient update happens. I'm not sure about any of these. I think a built-in Keras implementation could avoid the concatenation step entirely by computing the loss for each portion of the batch on its GPU and just averaging them, possibly on GPU0.

I also don't understand what happens to the weights in normal training/parallel training. If the update happens in the GPU, then for multi-gpu training, the result would have to be re-propagated to the rest of the GPU's, which is a special step. It doesn't seem efficient to transfer possibly hundreds of megabytes of weights for every mini-batch. Maybe someone who knows Keras better could comment.

@jonilaserson @tstandley I don't know if you have encountered such strange problem, but for me, the weights file that I saved from the multiple-GPU-training is always about 50KB smaller than the weights file that is from 1-GPU-training, the accuracy and loss looked normal during both training processes but the prediction result of multiple-GPU-training is a total mess... I tried to find the cause but until now still no success, could you share some ideas on this one?

I updated to the most recent version of keras and multi_gpu.py is no longer working for me. I get an error:

File "/home/tstand/Dropbox/image2weight/volume_model/multi_gpu.py", line 31, in make_parallel
outputs = model(inputs)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 569, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 632, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 164, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 2235, in call
output_tensors, output_masks, output_shapes = self.run_internal_graph(inputs, masks)
File "/usr/local/lib/python3.5/dist-packages/keras/engine/topology.py", line 2378, in run_internal_graph
computed_mask))
File "/usr/local/lib/python3.5/dist-packages/keras/layers/normalization.py", line 130, in call
broadcast_running_mean = K.reshape(self.running_mean, broadcast_shape)
File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 1228, in reshape
return tf.reshape(x, shape)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 2448, in reshape
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 493, in apply_op
raise err
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 490, in apply_op
preferred_dtype=default_dtype)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 669, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 176, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 165, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 441, in make_tensor_proto
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/tensor_util.py", line 441, in
tensor_proto.string_val.extend([compat.as_bytes(x) for x in proto_values])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 65, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got 1

Any idea what's going on?

The line in question is:
outputs = model(inputs)
So maybe models can no longer be called like this?

Hi,

I recently updated to Keras 2.0 and the code in multi_gpu.py broke. Below is the error trace. Nothing changed in the way I was using this code. Any thoughts?


ValueError Traceback (most recent call last)
in ()
2
3 # this uses the TensorFlow backend to spread computation on multiple GPUs
----> 4 model_gpu = make_parallel(model, GPUS)

/home/adalbert/nbserver/urban-environments/keras-utils/multi_gpu.pyc in make_parallel(model, gpu_list)
36 for x in model.inputs:
37 input_shape = tuple(x.get_shape().as_list())[1:]
---> 38 slice_n = Lambda(get_slice, output_shape=input_shape, arguments={'idx':i,'parts':gpu_count})(x)
39 inputs.append(slice_n)
40

/usr/local/lib/python2.7/dist-packages/keras/engine/topology.pyc in __call__(self, inputs, *kwargs)
552
553 # Actually call the layer, collecting output(s), mask(s), and shape(s).
--> 554 output = self.call(inputs, *
kwargs)
555 output_mask = self.compute_mask(inputs, previous_mask)
556

/usr/local/lib/python2.7/dist-packages/keras/layers/core.pyc in call(self, inputs, mask)
657 if 'mask' in arg_spec.args:
658 arguments['mask'] = mask
--> 659 return self.function(inputs, **arguments)
660
661 def compute_mask(self, inputs, mask=None):

/home/adalbert/nbserver/urban-environments/keras-utils/multi_gpu.pyc in get_slice(data, idx, parts)
13 print shape
14 print shape[:1] // parts, shape[:1]
---> 15 size = tf.concat(0, [ shape[:1] // parts, shape[1:] ])
16 stride = tf.concat(0, [ shape[:1] // parts, shape[1:]*0 ])
17 start = stride * idx

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.pyc in concat(values, axis, name)
1027 ops.convert_to_tensor(axis,
1028 name="concat_dim",
-> 1029 dtype=dtypes.int32).get_shape(
1030 ).assert_is_compatible_with(tensor_shape.scalar())
1031 return identity(values[0], name=scope)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in convert_to_tensor(value, dtype, name, preferred_dtype)
635 name=name,
636 preferred_dtype=preferred_dtype,
--> 637 as_ref=False)
638
639

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in internal_convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
700
701 if ret is None:
--> 702 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
703
704 if ret is NotImplemented:

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.pyc in _autopacking_conversion_function(v, dtype, name, as_ref)
903 if dtype is not None and dtype != inferred_dtype:
904 return NotImplemented
--> 905 return _autopacking_helper(v, inferred_dtype, name or "packed")
906 # pylint: enable=invalid-name
907

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.pyc in _autopacking_helper(list_or_tuple, dtype, name)
866 elems_as_tensors.append(
867 constant_op.constant(elem, dtype=dtype, name=str(i)))
--> 868 return gen_array_ops._pack(elems_as_tensors, name=scope)
869 else:
870 return converted_elems

/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.pyc in _pack(values, axis, name)
2039 A Tensor. Has the same type as values. The packed tensor.
2040 """
-> 2041 result = _op_def_lib.apply_op("Pack", values=values, axis=axis, name=name)
2042 return result
2043

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.pyc in apply_op(self, op_type_name, name, **keywords)
761 op = g.create_op(op_type_name, inputs, output_types, name=scope,
762 input_types=input_types, attrs=attr_protos,
--> 763 op_def=op_def)
764 if output_structure:
765 outputs = op.outputs

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in create_op(self, op_type, inputs, dtypes, input_types, name, attrs, op_def, compute_shapes, compute_device)
2327 original_op=self._default_original_op, op_def=op_def)
2328 if compute_shapes:
-> 2329 set_shapes_for_outputs(ret)
2330 self._add_op(ret)
2331 self._record_op_seen_by_control_dependencies(ret)

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in set_shapes_for_outputs(op)
1715 shape_func = _call_cpp_shape_fn_and_require_op
1716
-> 1717 shapes = shape_func(op)
1718 if shapes is None:
1719 raise RuntimeError(

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.pyc in call_with_requiring(op)
1665
1666 def call_with_requiring(op):
-> 1667 return call_cpp_shape_fn(op, require_shape_fn=True)
1668
1669 _call_cpp_shape_fn_and_require_op = call_with_requiring

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.pyc in call_cpp_shape_fn(op, input_tensors_needed, input_tensors_as_shapes_needed, debug_python_shape_fn, require_shape_fn)
608 res = _call_cpp_shape_fn_impl(op, input_tensors_needed,
609 input_tensors_as_shapes_needed,
--> 610 debug_python_shape_fn, require_shape_fn)
611 if not isinstance(res, dict):
612 # Handles the case where _call_cpp_shape_fn_impl calls unknown_shape(op).

/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/common_shapes.pyc in _call_cpp_shape_fn_impl(op, input_tensors_needed, input_tensors_as_shapes_needed, debug_python_shape_fn, require_shape_fn)
674 missing_shape_fn = True
675 else:
--> 676 raise ValueError(err.message)
677
678 if missing_shape_fn:

ValueError: Dimension 0 in both shapes must be equal, but are 1 and 3
From merging shape 0 with other shapes. for 'tower_0/lambda_1/concat/concat_dim' (op: 'Pack') with input shapes: [1], [3].

@adrianalbert
Change the code in multi_gpu from:
size = tf.concat(0, [shape[:1] // parts, shape[1:]])
stride = tf.concat(0, [ shape[:1] // parts, shape[1:]*0 ])

to:
size = tf.concat([shape[:1] // parts, shape[1:]], 0)
stride = tf.concat([shape[:1] // parts, shape[1:] * 0], 0)

@avolkov1 Are you able to run the multi_gpu.py now? Some people like me are stuck with the an "Incompatible shapes error" described here: https://github.com/kuza55/keras-extras/issues/7

@Eric2333 Yes, I'm able to run it. I modified it slightly. I don't know if this multi-gpu parallelism is correct to begin with, but it seems to run on multiple GPUs with the fixes I mentioned above. Here's the whole file code with modifications I made to get rid of warnings for Keras 2.0 and a few mods for my own purposes:

# ref: https://raw.githubusercontent.com/kuza55/keras-extras/master/utils/multi_gpu.py @IgnorePep8
from keras import backend as K
from keras.layers.core import Lambda
from keras.models import Model
# from keras.layers import merge
from keras.layers.merge import concatenate

if K.backend() == 'tensorflow':
    import tensorflow as tf  # @UnresolvedImport
    from tensorflow.python.client import device_lib

__all__ = ('make_parallel', 'get_available_gpus',)

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

def make_parallel(model, gdev_list):
    '''
    :param gdev_list: List of gpu devices i.e. ['/gpu:0', '/gpu:1', ...]
    '''
    gpu_count = len(gdev_list)

    def get_slice(data, idx=0, parts=0):
        shape = tf.shape(data)
        size = tf.concat([shape[:1] // parts, shape[1:]], 0)
        stride = tf.concat([shape[:1] // parts, shape[1:] * 0], 0)
        start = stride * idx
        return tf.slice(data, start, size)

    outputs_all = [[] for _ in range(len(model.outputs))]

    # Place a copy of the model on each GPU, each getting a slice of the batch
    for idev, gdev in enumerate(gdev_list):  # range(gpu_count):
        with tf.device(gdev):  # tf.device('/gpu:%d' % i):
            with tf.name_scope('tower_%d' % idev) as _:  # as scope
                inputs = []
                # Slice each input into a piece for processing on this GPU
                for x in model.inputs:
                    input_shape = tuple(x.get_shape().as_list())[1:]
                    slice_n = Lambda(
                        get_slice, output_shape=input_shape,
                        arguments={'idx': idev, 'parts': gpu_count})(x)
                    inputs.append(slice_n)

                outputs = model(inputs)

                if not isinstance(outputs, list):
                    outputs = [outputs]

                # Save all the outputs for merging back together later
                for l in range(len(outputs)):
                    outputs_all[l].append(outputs[l])

    # merge outputs on CPU
    with tf.device('/cpu:0'):
        merged = []
        for outputs in outputs_all:
            # merged.append(merge(outputs, mode='concat', concat_axis=0))
            merged.append(concatenate(outputs, 0))

        return Model(inputs=model.inputs, outputs=merged)

@avolkov1 What's your tensorflow version? It seems that my error comes from the callbacks function.

@Eric2333
Version 1.0.1. I am using a docker container for tensorflow:
tensorflow/tensorflow:1.0.1-devel-gpu

Taken from docker hub: https://hub.docker.com/r/tensorflow/tensorflow/tags/

I use nvidia-docker to run the container.

Here is an update version of to_multi_gpu() that is compatible with Keras 2.0:

from keras import backend as K
from keras.models import Model
from keras.layers import Input
from keras.layers.core import Lambda
from keras.layers.merge import Concatenate

def slice_batch(x, n_gpus, part):
    """
    Divide the input batch into [n_gpus] slices, and obtain slice no. [part].
    i.e. if len(x)=10, then slice_batch(x, 2, 1) will return x[5:].
    """
    sh = K.shape(x)
    L = sh[0] / n_gpus
    if part == n_gpus - 1:
        return x[part*L:]
    return x[part*L:(part+1)*L]


def to_multi_gpu(model, n_gpus=2):
    """Given a keras [model], return an equivalent model which parallelizes
    the computation over [n_gpus] GPUs.

    Each GPU gets a slice of the input batch, applies the model on that slice
    and later the outputs of the models are concatenated to a single tensor, 
    hence the user sees a model that behaves the same as the original.
    """
    with tf.device('/cpu:0'):
        x = Input(model.input_shape[1:], name=model.input_names[0])

    towers = []
    for g in range(n_gpus):
        with tf.device('/gpu:' + str(g)):
            slice_g = Lambda(slice_batch, lambda shape: shape, arguments={'n_gpus':n_gpus, 'part':g})(x)
            towers.append(model(slice_g))

    with tf.device('/cpu:0'):
        merged = Concatenate(axis=0)(towers)

    return Model(inputs=[x], outputs=[merged])


Thanks @jonilaserson ! I can run three GPUs now :) However, I have to compile my model before and after I call "to_multi_gpu". Any ideas?

model=Sequential()
model.add(Dense(32,activation='relu',input_dim=(28 * 28)))
model.add(Dense(16,activation='relu'))
model.add(Dense(10,activation='softmax'))

model.compile(optimizer=RMSprop(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model = multigpu.to_multi_gpu(model, n_gpus=3)
model.compile(optimizer=RMSprop(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, validation_split = 0.05, epochs=25, batch_size=64)
...
works!
model=Sequential()
model.add(Dense(32,activation='relu',input_dim=(28 * 28)))
model.add(Dense(16,activation='relu'))
model.add(Dense(10,activation='softmax'))

model.compile(optimizer=RMSprop(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
model = multigpu.to_multi_gpu(model, n_gpus=3)
...
RuntimeError: You must compile a model before training/testing. Use `model.compile(optimizer, loss)`.
model=Sequential()
model.add(Dense(32,activation='relu',input_dim=(28 * 28)))
model.add(Dense(16,activation='relu'))
model.add(Dense(10,activation='softmax'))

model = multigpu.to_multi_gpu(model, n_gpus=3)
model.compile(optimizer=RMSprop(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])
...
AttributeError: 'Sequential' object has no attribute 'input_names'

@abrahamrhoffman
If you want to run anyway, change this line

x = Input(model.input_shape[1:], name=model.input_names[0])

to

x = Input(model.input_shape[1:], name="test_name")

and call multigpu.to_multi_gpu(model, n_gpus=3) before model.compile.

@palloc @avolkov1 @jonilaserson I can load the model no problem using model to json and from json. My problem is loading the weigths of multi-gpu model. I saved them with ModelCheckpoint .h5 extension. Any help or suggestion aprreaciated.

@abnera ModelCheckpoint calls your model's save function. Because you have a multi-gpu model, you would like it to call the underlying model's save function instead. I wrote this code to monkeypatch the underlying model's save method onto the multi-gpu model:

new_model = Model(input=model.inputs, output=merged)
funcType = type(model.save)
# monkeypatch the save to save just the underlying model
def new_save(self_,filepath, overwrite=True):
    model.save(filepath, overwrite)
new_model.save=funcType(new_save, new_model)
return new_model

@abnera For me when I load the saved weights from the multi-GPU model, the model
won't give me the correct prediction. I have to save the model before data parallelism to make it work.

@tstandley thanks for the quick reply, are you saying that by using this type(model.save) ModepCheckpoint knows about this function and saves it in the original (single_gpu_model). So all I need to do is create this function and ModelCheckpoint knows about it?

@abnera You need to replace the save function on the multi-gpu model with the save function of the single gpu model. The code I posted does that (or did several months ago).

@tstandley Yeah. dummy me. I simply created a local ModelCheckpoint Callback where I always save the original model instead of the gpu version.

Thanks!

For those of you who are curious I posted my experimentation with different data-parallelism strategies.
https://github.com/avolkov1/keras_experiments
python package/library: https://github.com/avolkov1/keras_experiments/tree/master/keras_exp

More specifically take a look at:
https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/_multigpu.py
https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/optimizers.py

Example cifar10:
https://github.com/avolkov1/keras_experiments/blob/master/examples/cifar10_cnn_mgpu.py

In the optimizers.py I implemented an experimental approach OptimizerMultiGPUMixin trying to use NCCL to average the gradients. This approach did not seem to yield any speedup. Perhaps someone smarter than me can tinker with this code and get something working.

@jonilaserson thanks for the update for keras 2.0. I wonder can this multi-gpu code can be used for fit_generator function? If so, how can I use it? Thanks.

If you use batchnormalization layer in multi-GPU I would suggest to compile to the latest master (not included in 1.1.0 yet) due to this: https://github.com/tensorflow/tensorflow/pull/8906

Does anyone use this to_multi_gpu on keras2 succesfully on more than 2 gpus. my environment is a remote centos server which has 4 k80 cards(8 gpus), when i try to call on more than 2 gpus, the server crash, and I lost the connection ......

The to_multi_gpu() is not working in my environmnet (Python 3.6, TF 1.1 compiled three days ago).

however, make_parallel() from avolkov1 works fine. thanks!
BTW: for the training and validation set: len(train_set) % number_of_gpu == 0

This is the error message I get from to_multi_gpu():

Traceback (most recent call last):
File "main.py", line 273, in
sys.exit(main())
File "main.py", line 99, in main
testmodel.doc2vec_train()
File "/home/../NLP/tweetmodel.py", line 261, in doc2vec_train
self.tweetmodel.doc2VecModel_train_keras(self.parameter)
File "/home/.../NLP/tweetcluster.py", line 1490, in doc2VecModel_train_keras
model = to_multi_gpu(model, 2)
File "/home/.../NLP/tweetmodel.py", line 921, in to_multi_gpu
slice_g = Lambda(slice_batch, lambda shape: shape, arguments={'n_gpus': n_gpus, 'part': g})(x)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py", line 585, in __call__
output = self.call(inputs, *kwargs)
File "/home/../develop/anaconda3/lib/python3.6/site-packages/keras/layers/core.py", line 659, in call
return self.function(inputs, *
arguments)
File "/home/.../NLP/tweetmodel.py", line 905, in slice_batch
return x[part * L:(part + 1) * L]
File "/home/../develop/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 497, in _SliceHelper
name=name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 655, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3568, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 590, in apply_op
param_name=input_name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 61, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'begin' has DataType float32 not in list of allowed values: int32, int64

which version of keras do you use?

On May 8, 2017 11:05 AM, "JoeSchnorcher" notifications@github.com wrote:

The to_multi_gpu() is not working in my environmnet (Python 3.6, TF 1.1
compiled three days ago).

however, make_parallel() from avolkov1 works fine. thanks!
BTW: for the training and validation set: len(train_set) % number_of_gpu ==
0

This is the error message I get from to_multi_gpu():

Traceback (most recent call last):
File "main.py", line 273, in
sys.exit(main())
File "main.py", line 99, in main
testmodel.doc2vec_train()
File "/home/../NLP/tweetmodel.py", line 261, in doc2vec_train
self.tweetmodel.doc2VecModel_train_keras(self.parameter)
File "/home/.../NLP/tweetcluster.py", line 1490, in doc2VecModel_train_keras
model = to_multi_gpu(model, 2)
File "/home/.../NLP/tweetmodel.py", line 921, in to_multi_gpu
slice_g = Lambda(slice_batch, lambda shape: shape, arguments={'n_gpus':
n_gpus, 'part': g})(x)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/keras/engine/topology.py",
line 585, in call
output = self.call(inputs, *kwargs)
File "/home/../develop/anaconda3/lib/python3.6/site-packages/keras/layers/core.py",
line 659, in call
return self.function(inputs, *
arguments)
File "/home/.../NLP/tweetmodel.py", line 905, in slice_batch
return x[part * L:(part + 1) * L]
File "/home/../develop/anaconda3/lib/python3.6/site-packages/
tensorflow/python/ops/array_ops.py", line 497, in _SliceHelper
name=name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/
tensorflow/python/ops/array_ops.py", line 655, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/
tensorflow/python/ops/gen_array_ops.py", line 3568, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/
tensorflow/python/framework/op_def_library.py", line 590, in apply_op
param_name=input_name)
File "/home/.../develop/anaconda3/lib/python3.6/site-packages/
tensorflow/python/framework/op_def_library.py", line 61, in
_SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'begin' has DataType float32 not in
list of allowed values: int32, int64

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-299893024, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AIBBircVMTKwyHx7AxagFSgf8Co8jqIKks5r3y9JgaJpZM4IMTcS
.

I use this version of Keras:

pytest$ python -c "import keras; print(keras.__version__)"
Using TensorFlow backend.
2.0.4
        def slice_batch(x, n_gpus, part):
            sh = K.shape(x)         # x is dtype float32, sh is dtype int32
            L = sh[0] / n_gpus      # L is dtype float64
            if part == n_gpus - 1:
                return x[part * L:]
            return x[part * L:(part + 1) * L]      #  TypeError: Value passed to parameter 'begin' has 
                                #  DataType float64 not in list of allowed values: int32, int64

`
# the monkey cast might solved the TypeError BUT the model runs on one GPU only.

          L = tf.to_int32(L)

`

@JoeSchnorcher @avolkov1 The error you mention is related to the batch_size not being a multiple of the number of gpus. Therefore, when you divide an input batch into number of gpus you get a floating value instead of a integer number and that breaks the code.

Error: TypeError: Value passed to parameter 'begin' has DataType float32 not in list of allowed values: int32, int64

Solution:
In method: def slice_batch(x, n_gpus, part)
Replace : L = sh[0] / n_gpus
With: L = sh[0] // n_gpus

The double slash will ensure you truncate the value into an integer example: 11/2 = 5 instead of 5.5
That should fix your code.

I also cced @avolkov1 so that he could fix it in his repository.

Thanks @abnera, this solved the error. Interesting, the batch_size was 128 and the number of GPUs is 2......

@JoeSchnorcher it could also be that your very last batch is smaller than 128 and it crashes at the last batch on an epoch. An example number of samples = 131 first batch no problem but on the second batch 3/2 gives you an error, assuming your exhausting all the samples in each epoch. Anyways, glad to help.

@abnera Thanks. I'll post my latest code soon. I added various features I was experimenting with such as contrib.nccl, and data_flow_ops.StagingArea. None of it made the code any faster though and sometimes slowed it down. But perhaps someone else will have better luck.

I want to experiment with Dataflow implementation of Tensorpack. I am not able to saturate the GPU utilization with this multi-gpu approach so I'm experimenting with different methods of staging data. My goal is to have a simple and peformant mini-framework/library. I'm using Keras for the Model layer, and use something like Dataflow for the data layer. Tensorpack mostly does what I'm trying to achieve, but it's very complicated with too many classes. I'm hoping to just re-use the Dataflow portion.

At this point, I really think multi_gpu.py should be integrated into the keras repo. This way people can do pull requests instead of trying to keep track of tweaks posted in an issue.

I think it should be integrated with keras also, having to copy all that data (I'm doing dense predictions) onto CPU instead of being able to calculate loss on each gpu then only transfer that kills performance.

I think Keras should simple have an interface that simply lets you specify
GPU id's to be used for training and testing time. This multi-gpu thing
with all its tweaks and nuances is a huge annoyance.

The contributions are exrtemely valuable and should be standard in the
Keras library for sure.

On Tue, May 9, 2017 at 7:18 PM, Joe Yearsley notifications@github.com
wrote:

I think it should be integrated with keras also, having to copy all that
data (I'm doing dense predictions) onto CPU instead of being able to
calculate loss on each gpu then only transfer that kills performance.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-300328602,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-SrcJQyVQO878A5ZFeLnKmUgfJwufks5r4PQ7gaJpZM4IMTcS
.

@avolkov1 does doing

 with tf.device(gdev):  # tf.device('/gpu:%d' % i):
    with tf.name_scope('tower_%d' % idev) as _:  # as scope
        inputs = []
        # Slice each input into a piece for processing on this GPU
        for x in model.inputs:
            input_shape = tuple(x.get_shape().as_list())[1:]
            slice_n = Lambda(
                get_slice, output_shape=input_shape,
                arguments={'idx': idev, 'parts': gpu_count})(x)
            inputs.append(slice_n)

requires all the data to be passed to each GPU while they will only use a slice of it? Or are things working just fine ?

@gsabran Yes, I think it might. So that's wasteful of GPU memory. I just pushed my latest tweaks for this multi-gpu stuff. Refer to my implementation here:
https://github.com/avolkov1/keras_experiments

And here for the actual implementation details:
https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/_multigpu.py

I create the slice under cpu context and apply the model under the dev context thus avoiding this copy I think.

# snippet
        for idev, dev in enumerate(gdev_list):
            # TODO: The last slice could cause a gradient calculation outlier
            # when averaging gradients. Maybe insure ahead of time that the
            # batch_size is evenly divisible by number of GPUs, or maybe don't
            # use the last slice.
            with tf.device('/cpu:0'):
                slices = []  # multi-input case
                for ix, x in enumerate(model.inputs):
                    slice_g = Lambda(
                        slice_batch,  # lambda shape: shape,
                        lambda shape: x.shape.as_list(),
                        name='stage_cpuSliceIn{}_Dev{}'.format(ix, idev),
                        arguments={'ngpus': ngpus, 'part': idev,
                                   'dev': dev})(x)
                    slices.append(slice_g)
                    # print('SLICE_G: {}'.format(slice_g))  # DEBUG
                # print('SLICES: {}'.format(slices))  # DEBUG

            with tf.device(dev), \
                    tf.variable_scope(global_scope, reuse=idev > 0), \
                    tf.name_scope('tower_%i' % idev):
                modeltower = model_(slices)
                towers.append(modeltower)
# ...

Test it and let me know. I was able to use huge batch sizes in cifar example that wouldn't fit on a single GPU, but seemed to run fine with multiple-GPUs. The huge batch size messes up the learning convergence, but for testing I just wanted to verify that I could overcome the memory limitation of a single GPU.

With gradients averaging I ran out of memory, but I was still able to max out slightly bigger batch sizes than single GPU. Don't use gradients averaging if you are memory limited. It's not necessary anyways, because TF implicitly adds gradients (some kind of magic via TF parameter server stuff I think and Keras shared layers). When averaging gradients the multi-GPU performance is not any faster compared to single-gpu, nor better convergence compared to not averaging gradients at least in my testing.

@avolkov1 Glad you're doing this! Keep up the good work.

A few of points:

  1. This was written for python 2 right? It almost works with python3. I had to change
    from cStringIO import StringIO
    to
    from io import StringIO
  1. the part that overrides the compile method assumes your optimizer is explicit. I was using a string 'Adadelta' in my call to compile which doesn't have the usenccl property (because it's a string). I wound up explicitly creating the optimizer instead, and that worked, but it would be nice if it worked with strings

Finally:
Sadly I couldn't get this to work for my model It just kept printing warnings:
WARNING:tensorflow:Tried to colocate gradients/tower_0/model_2/model_1/block12_sepconv3_bn/moments/sufficient_statistics/count_grad/

@tstandley Yea, I primarily work with Python 2.7 but I'll add the fix to make it compatible with Python 3. These features are experimental. Maybe I'll add support for passing a string to be compatible with Model interface, but besides slight convenience it's actually in bad style to have completely different expected types for a parameter (string is not related to Optimizer derived class object).

Are you getting the warning when specifying to use NCCL? Or regardless you're always getting the warning when parallelizing your model for multi-GPUs?

@avolkov1 How does your method compare to make_parallel() in terms of speed? Have you compared the two?

@avolkov1
How will your model save the weight after each epoch? I tried the way provided by @jonilaserson but I could not save the model because Keras seemed to try saving all the weights from all the duplicate gpus (it should not).

From your snippet, the approach is nearly the same so does your also behave like that?

@vqdang
It works for me. The model weights and biases are shared between GPUs via the cpu (internals of TF via parameter server). The saving needs to happen using the non-mgpu model instance, ie the model passed to make parallel function. The mgpu model augmentation injects data slicing layers at input, and concat layer at output. It's not a typical Keras model, that's why I wrote a new class ModelMGPU to handle details like saving/loading model via reference to non-mgpu model.

Just use Pytorch. This is such an annoying issue with so many nuances that
make it almost impossible to keep up with. This issue is one of Keras
biggest downfalls. The ideal state is to feed gpu_ids as a list through
some API that contains a parameter "Optimize_for" that expects one of the
following ['Speed','Memory', 'Balanced'].

On Mon, Jun 12, 2017 at 11:22 AM, dylanrandle notifications@github.com
wrote:

@jonilaserson https://github.com/jonilaserson I'm getting the error:
ValueError: Concatenate layer should be called on a list of inputs from
to_multi_gpu.

I'm using the tensorflow.contrib version of Keras.

The call is: merged = Concatenate(axis=0)(towers) where 'towers' is a list
of models.

Is this correct?

Any help greatly appreciated.

Thank you,
Dylan

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-307822698,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-Smi558PEsIcuXO2yA6_PzR-vKTR8ks5sDVejgaJpZM4IMTcS
.

@avolkov1
@jonilaserson
I am experiencing a bizarre problem with both of your implementations: If I expand from 1 to 2 GPUs I can pretty much double the batch size without getting OOM but if I go to 3 or 4 GPUs I get OOM on the batch size that is only slightly bigger than the 2 GPU one.
Something like this: 1 GPU (batch size 1x ) to 2 GPUs (batch 1.6x ) is okay by 1 GPU (batch 1x) to 4 GPUs (batch 2.2 x) I get OOM).
I need to use larger batches to get good GPU saturation (P100s are monsters) so having this as the limitation doesn't allow me to fully use the resource I have
Environment: Keras 2.05, TF 1.01

@Slanothy
Don't use the multi gpu optimizers that I defined here:
https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/optimizers.py

Using those specialized optimizers batch size doesn't scale in regards to memory. Use regular keras optimizers: from keras_exp.multigpu.optimizers ... with an instance of class ModelMGPU (either by calling make_parallel or instantiating ModelMGPU directly).

If you're getting OOM with regular optimizers let me know. I would need a reproducible example.

@avolkov1
I am using the native optimizers as per your suggestion here https://github.com/avolkov1/keras_experiments/issues/1 and still getting this error...

All of the flags on make_parallel are kept at default. I will try to come up with a good reproducible example but it might be challenging due to differences in the GPUs we use (unless you have a box with 4 16GB P100s). I can't share the data I use unfortunately but I use relatively long sequences for inputs (750 timesteps - 4 numbers each), embed them and concatenate into the LSTM layer of size 300...

Let me know if you would like a skeleton of the code or if I could play around with flags...

Testing specific numbers: batch of 1000 fits on 1 GPU but only 1600 fits on 2 and 2000 to 2200 on 4 (with training OOMing after a few batches once in a while). I will think about the best approach to reproduce this issue without going into the specifics of my data...

@Slanothy
Yea, I have access to the hardware. Share whatever is shareable and I'll take a look. Send it to me at "[email protected]" if you'd like. If you are using a container maybe try these flags: --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 when starting up the container.

@avolkov1 I am getting Invalid argument "name" passed to K.function with Tensorflow backend (in make_train_function), when using the latest version of your code. the previous one with no monkey patches worked...

The model prints accordinly and it works with 1 gpu.

@abnera
See this issue https://github.com/avolkov1/keras_experiments/issues/1 . You can add the name argument yourself for now or just comment out the monkey patch. It works for me

@abnera
@tRosenflanz
Sorry about that. I pushed the latest update for the monkey patch. Pull the latest changes. The monkey patched Function now has the following signature:

class Function(object):
    """Runs a computation graph.

    # Arguments
        inputs: Feed placeholders to the computation graph.
        outputs: Output tensors to fetch.
        updates: Additional update ops to be run at function call.
        name: a name to help users identify what this function does.
        enqueue_ops: List of ops to be run at funtion call for enqueue'ing.
    """

    def __init__(self, inputs, outputs, updates=None, name=None,
                 enqueue_ops=None, **session_kwargs):

I did this for the enqueue option in make_parallel function. If you're not using that enqueue option in the make_parallel function, then comment out the monkey patching in keras_exp/multigpu/_multigpu.py lines 36 and 37. Eventually I'll probably remove that monkey patching code because it's fragile. In this one case I'm modifying things under the hood instead of adhering to the public Keras API.

@avolkov1 et al.
I'm able to run keras_exp/multipgpu out of the box without problems on 4 Titan Xp with latest Keras/TF. Similar to what some describe above, occasionally I'll get a CUDA error that crashes training immediately after the first epoch, but seems to resolve with tweaks to the batch number (it happens even with batch numbers that are divisible by 4).

tensorflow/stream_executor/cuda/cuda_dnn.cc:427] could not set cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
tensorflow/stream_executor/cuda/cuda_dnn.cc:427] could not set cudnn tensor descriptor: CUDNN_STATUS_BAD_PARAM
Aborted (core dumped)

On other situations I get this warning after first epoch— but training goes on without problems: can anyone clarify what it means?

I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] 
PoolAllocator: After 3788 get requests, put_count=3769 evicted_count=1000 
eviction_rate=0.265322 and unsatisfied allocation rate=0.295407
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] 
Raising pool_size_limit_ from 100 to 110

@mptorr Hmm, based on some googling it seems like the CUDNN_STATUS_BAD_PARAM might be due to an empty array being processed by Tensorflow session. If the bug is not due to your logic of how you are reading/feeding the data into TF/Keras, it might be that the multigpu is not slicing the data properly causing this (in which case I would like to identify this corner case and fix it). I found a few references to that bug on internet search: google the bug

The other warning with pool_allocator I think is benign although maybe that's a symptom of some issue. I've seen a use case where out-of-memory (OOM) error was occurring. The fix was to instantiate the initial serial model under the tf.device('/cpu:0') context i.e. pin to CPU prior to parallelizing for multi-gpu. I suggest doing that in the docstring of make_parallel function if an OOM error occurs.

If you can share a code snippet to reproduce the issue I can dig into it some more. Please file an issue https://github.com/avolkov1/keras_experiments/issues with a reproducible code snippet.

thanks so much @avolkov1, yes I had seen the thread on TF and had thoroughly checked my workflow for issues, therefore I was wondering if keras_exp could be the source for CUDNN_STATUS_BAD_PARAM under specific circumstances (i.e., certain batch sizes).

I'll work on posting a reproducible code snippet error #2: interestingly, the pool_allocator warning also seems to be dependent on batch size.

I have tried to_multi_gpu function but I got 2x slower result than when I use single GPU. I have a 10 Nvidia Titan Xp GPUs environment. Anybody knows what is the problem?

@dasolyn , Does your server enable full connectivity between the 10 GPUs?

@barrykui Yes, I think so, maybe. Here is output of tensorflow when I train it.
toutput

@dasolyn I got similar results (multiple GPUs slower than 1). So far the problem seems to be the gradients are not computed in parallel, but on the parameter server device. See issue #7515 for more details and work-in-progress implementation of gradient averaging.

@dasolyn @bzamecnik , maybe you could test barrykui/keras_multi_gpu which I test before. Hope that code would help you.

@barrykui Thanks. Just from looking at the code it seems it's just another version of the split/concat code evolving around. There's not parallel gradient computation and gradient averaging.

According to my measurement of this code on 1. 2 or 4 GTX 1070 GPUs, the multi-GPU variant is slower and not scaling well. Time per epoch is taken cca after epoch 2-3, when the times stabilize.

1x GTX 1070
$ CUDA_VISIBLE_DEVICES=2 time python mnist_cnn.py
1:34.58 min
7s/epoch

2x GTX 1070
$ CUDA_VISIBLE_DEVICES=2,3 time python mnist_cnn.py
10s/epoch

4x GTX 1070
$ CUDA_VISIBLE_DEVICES=0,1,2,3 time python mnist_cnn.py
21s/epoch

Yes. Keras doesn't support multi-gpu. If you want to utilize your
environment properly (with near linear scaling) use MXNET or pyTorch.

On Tue, Aug 8, 2017 at 1:40 AM, dasolyn notifications@github.com wrote:

I have tried to_multi_gpu function but I got 2x slower result than when I
use single GPU. I have a 10 Nvidia Titan Xp GPUs environment. Anybody knows
what is the problem?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-320855751,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-StD8dBhUVaCPlhIgSN-dyg4085RXks5sV_TkgaJpZM4IMTcS
.

@pGit1 This is simply not true. Keras does support multiple GPUs with tensorflow as this issue discussion shows. multi_gpu.py implements data parallelism. Data parallelism isn't useful for all architectures, so that might be the problem for @bzamecnik. That said, data parallelism is probably what most people want for most things, so it's also likely that @bzamecnik is actually encountering some kind of error. We'd need more details to diagnose.

For many architectures, you can use model parallelism with Keras and tensorflow as well. Pointers are in the code above.

When using this Code https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py it seems, that there is an error in the regularizers like described in this issue https://github.com/kuza55/keras-extras/issues/22. I cant tell if this is caused by the method, the model is copied and merged or if there is a flaw in the regularizers of keras. This is in tensorflow 1.12.

def` conv2d_bn(x, nb_filter, nb_row, nb_col, padding='same', strides=(1, 1), bias=False):

 """
    Utility function to apply conv + BN.
    (Slightly modified from https://github.com/fchollet/keras/blob/master/keras/applications/inception_v3.py)
    """
    if K.image_data_format() == "channels_first":
        channel_axis = 1
    else:
        channel_axis = -1
    x = Convolution2D(nb_filter, (nb_row, nb_col),
                      strides=strides,
                      padding=padding,
                      use_bias=bias,
                      kernel_regularizer=regularizers.l2(0.00004), ##<---- causes error because no _loss 
                      kernel_initializer=initializers.VarianceScaling(scale=2.0, mode='fan_in', distribution='normal',
                                                                      seed=None))(x)
    x = BatchNormalization(axis=channel_axis, momentum=0.9997, scale=False)(x)
    x = Activation('relu')(x)
    return x

I get the error:
„AttributeError: 'Model' object has no attribute '_losses'„
caused by outputs = model (inputs) that merges the outputs of the different splits in one model.

Dont use Keras for Multi-GPU training. That is the solution. Use native
Tesnorflow, MXNET, or some other reasonably multi-gpu friendly library.

On Tue, Aug 22, 2017 at 12:17 AM, CeadeS notifications@github.com wrote:

When using this Code https://github.com/kuza55/keras-extras/blob/master/
utils/multi_gpu.py it seems, that there is an error in the regularizers
like described in this issue kuza55/keras-extras#22
https://github.com/kuza55/keras-extras/issues/22. I cant tell if this
is caused by the method, the model is copied and merged or if there is a
flaw in the regularizers of keras. This is in tensorflow 1.12.

def` conv2d_bn(x, nb_filter, nb_row, nb_col, padding='same', strides=(1, 1), bias=False):

"""
Utility function to apply conv + BN.
(Slightly modified from https://github.com/fchollet/keras/blob/master/keras/applications/inception_v3.py)
"""
if K.image_data_format() == "channels_first":
channel_axis = 1
else:
channel_axis = -1
x = Convolution2D(nb_filter, (nb_row, nb_col),
strides=strides,
padding=padding,
use_bias=bias,
kernel_regularizer=regularizers.l2(0.00004), ##<---- causes error because no _loss
kernel_initializer=initializers.VarianceScaling(scale=2.0, mode='fan_in', distribution='normal',
seed=None))(x)
x = BatchNormalization(axis=channel_axis, momentum=0.9997, scale=False)(x)
x = Activation('relu')(x)
return x

I get the error:
„AttributeError: 'Model' object has no attribute '_losses'„
caused by outputs = model (inputs) that merges the outputs of the
different splits in one model.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-323914735,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-Soh_82c7QOgf3s5O6shUUNNmj7yhks5salZKgaJpZM4IMTcS
.

@pGit1 it has already been pointed out that Keras can support multi-gpu training so this kind of defeatist attitude is not helpful and is quite misguiding for a lot of people trying to find a solution... I personally prefer https://github.com/avolkov1/keras_experiments/ for now for it's ease of use

@tRosenflanz Yes, also I came around https://github.com/avolkov1/keras_experiments/ recently and have been testing that. So far it seems it's indeed working well. Also a problem with my previous measurements was a wrongly set up machine. On a standard cloud instance with multiple GPUs I can observe a speed up, especially with avolkov1 code. Anyway I'm writing a summary article with a lot of measurements, so stay tuned.

FYI - we just added an example of data-parallel distributed training with Keras using Horovod - https://github.com/uber/horovod/blob/master/examples/keras_mnist.py. It works both for multiple GPUs within the server, and across servers. Hope it helps.

I used the code of @jonilaserson. And it works. However, it seems that multi-gpu converged slower compared to single gpu. Anyone else observed the same?

@michelleowen you typically want to adjust learning rate to total # of GPUs across all the servers - here's an example for very simple scaling. Facebook published a paper with a more sophisticated strategy that works for a large number of GPUs.

@alsrgv, thank you. This is very helpful. I will do some experiments to see how it works in my case.

I guess the function previously mentioned by @avolkov1 is finally coming into Keras:
https://github.com/fchollet/keras/blob/master/keras/utils/training_utils.py

@fernandoandreotti Yes and no. It's a cleaned-up variant of function from kuza55. It has nice documentation and grabs list of devices via device_lib instead of CUDA_VISIBLE_DEVICES. On the other hand it's missing some stuff from avolkov1: slicing on CPU, save/load of parameters of original serial model. Since there's no wrapper class, so the latter is not necessary, but at least might be documented.

Keras v2.0.9 now includes it (release notes). Despite the improvements that can be done, I guess this issue should be closed.

Any example of how to use this in the docs?

On Fri, Nov 3, 2017 at 8:55 AM, Fernando Andreotti <[email protected]

wrote:

Keras v2.0.9 now includes it. Despite the improvements that can be done, I
guess this issue should be closed.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2436#issuecomment-341695139,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ANU-StrjyNRZPJPvmUVjvJLg-EBD08eCks5syw0pgaJpZM4IMTcS
.

Yes: https://keras.io/utils/#multi_gpu_model

You can also check out Horovod, which seems nice.

Is there any intention for making it work with CNTK too?

@avolkov1 @jonilaserson Is there an issue with saving models using ModelCheckpoint using multi_gpu model. I actually used few other callbacks but it worked fine, but ModelCheckpoint is the one which fails to save the model, and throws error after an epcoh.

CODE

`class MyCallBack(keras.callbacks.Callback):
def __init__(self, callbacks,model):
super().__init__()
self.callback = callbacks
self.model = model

def on_epoch_begin(self,epoch,logs=None):
        self.callback.on_epoch_begin(epoch, logs=logs)

def on_epoch_end(self,epoch,logs=None):
        self.callback.on_epoch_end(epoch, logs=logs)

def on_batch_end(self, batch, logs=None):
        self.callback.on_batch_end(batch, logs=logs)

def on_batch_begin(self, batch, logs=None):
        self.callback.on_batch_begin(batch, logs=logs)

def on_train_begin(self, logs=None):
        self.callback.set_model(self.model)
        self.callback.on_train_begin(logs=logs)

def on_train_end(self, logs=None):
        self.callback.on_train_end(logs=logs)

parallel_model = multi_gpu_model(model, gpus=2)
parallel_model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=lr_schedule(0)),metrics=['accuracy'])

Setting up Callbacks, during fitting of the Model

filename='model_train_new.csv'
filepath = os.path.join(save_dir, model_name)
checkpoint = ModelCheckpoint(filepath=filepath, monitor='val_acc',verbose=1,
save_best_only=True)
cbk3 = MyCallBack(checkpoint, model)
callbacks = [cbk3]

Adding Data Augmentation Provided by Keras Module

datagen=ImageDataGenerator(featurewise_center=False,samplewise_center=False,featurewise_std_normalization=False,samplewise_std_normalization=False,zca_whitening=False,rotation_range=0, width_shift_range=0.1,height_shift_range=0.1,horizontal_flip=True,vertical_flip=False)

datagen.fit(x_train)
steps_per_epoch = int(np.ceil(x_train.shape[0] / float(batch_size)))
model_info = parallel_model.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),
steps_per_epoch=steps_per_epoch,
validation_data=(x_test, y_test),
epochs=epochs, verbose=1, workers=4,
callbacks=callbacks)`

@nbansal90

I had this same problem. Model Checkpoint will not work with multi GPU model. You can change the parameter save_weights_only to True and this will work fine HOWEVER if you then want to do inference on a SINGLE gpu the model will not load weights properly even if you load the checkpointed weights by name.

@fchollet

Kind of an urgent question: is there a way to train on multiple GPUs but save the weights in such a way that I can do inference on only a single GPU? I am not sure how to get this to work properly as model.load_weights(''/weights_path', by_name=True) does not work. I have to re instantiate the network as a multi-gpu-model to properly load weights. I may be missing something simple though.

mmmm since it's urgent, maybe a dirty patch will do: couldn't you save the weights as matrices and then load them directly into the weights of the layers of a new (single GPU) model?

edit: saving/loading the weights of the example from the docs doesn't work? https://keras.io/utils/#multi_gpu_model

@fercook

Thanks for quick response. I believe I have tried that. My weights were saved via the Model Checkpoint callback for a multi gpu model.

When I re-instantiate the model I cannot load the weights to my single GPU model because I get an error stating that I am trying to load weights into a model with one layer when it expects four layers(4 is the number of GPUs I was using).

edit:

edit: saving/loading the weights of the example from the docs doesn't work? https://keras.io/utils/#multi_gpu_model

That is correct. It does not work. Although I haven't tried the cpu device scope. Will try and let know. Ive only used model checkpoint callback with save_weights_only = True and model.load_weights.

Did you double check that you are saving with the template model, not the multi_gpu one?

From the docs:

On model saving

To save the multi-gpu model, use `.save(fname)` or `.save_weights(fname)`
with the template model (the argument you passed to `multi_gpu_model`),
rather than the model returned by `multi_gpu_model`.

edit: sorry I just re-read that you are saving through the callback...how are you doing that? Is each GPU saving a different file (or overwriting it)?

@pGit1 Take a look at my example:
https://github.com/avolkov1/keras_experiments/blob/master/examples/cifar/cifar10_cnn_mgpu.py

Run like this to save weights:

python ./examples/cifar/cifar10_cnn_mgpu.py --epochs=3 --mgpu --checkpt --aug

Can then run again and it will load the checkpoint file and continue training. This will work with single GPU also.

CUDA_VISIBLE_DEVICES=0 python ./examples/cifar/cifar10_cnn_mgpu.py --epochs=3 --mgpu --checkpt --aug

I have a slightly different implementation for multigpu, but you can use the mutligpu implementation from Keras. Just wrap it in a class to use the non-multigpu model for saving and loading weights.
https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/_multigpu.py#L129

The essence of the wrapper class for saving/loading weights is:

    def __getattribute__(self, attrname):
        '''Override load and save methods to be used from the serial-model. The
        serial-model holds references to the weights in the multi-gpu model.
        '''
        # return Model.__getattribute__(self, attrname)
        if 'load' in attrname or 'save' in attrname:
            return getattr(self._smodel, attrname)

        return super(ModelMGPU, self).__getattribute__(attrname)

This works with fit_generator.

@fercook

Since the ModelCheckpoint is only saving the weights it may be overwriting it.

@avolkov1

Thank you! I'll take a look!!

@fercook

I've confirmed that the example from the docs will not work with Model Checkpoint call back either.
FYI: my callback code -

best_wts_callback = callbacks.ModelCheckpoint(mod_wt_path, save_weights_only=True, save_best_only=True)

@avolkov1

Your example seems like it maywork but I having trouble thinking of a simple example of how to use. Your guidance would be much appreciated.

Is something like this feasible?

.
.
.
.
.
# model topology instantiation above
ser_model = Keras.models.Model(inputs = x, output=out)
parallel_model = avolkov1.make_parallel(serial_model = ser_model, gdev_list=['/gpu:0', '/gpu:1', '/gpu:2','/gpu:3',]),ps_device='/cpu:0', model_class=avolkov1.ModelMGPU)

#callback to save best weights
mod_wt_path = './best_weights.hdf5'
best_wts_callback = callbacks.ModelCheckpoint(mod_wt_path, save_weights_only=True, save_best_only=True)

parallel_model.fit(X, y, callbacks=[best_wts_callback])

#Now I want to infer on single GPU so I load saved weights ??
ser_model.load_weights(mod_wt_path)

ser_model.predict(X_holdout)

Would something like this work? Actually I need a more exact version of what would actually work.

THANK YOU!

EDIT:

Looking at you Cifar 10 example it looks like something like this would work. Im in a crunch so don't want to embark on the above journey if I am missing something glaring.

@avolkov1

In general I think this line from docs in your code explain it all

'''Override load and save methods of the multi-gpu model. The load and
save should correspond to the serial model's load and save.

In general one should be easily be able to train in parallel on multiple GPUs use callbacks to save weights on the parallel run and load back those saved weights to the serial model that was parallized in the first place (without having to re-instantiate the serial model as a parallel model). I think your code allows one to train on 8 GPUs but then load weights and infer on one. It should be a option perhaps in the >=2.0.9 implementation? Training with keras.utils.multi_gpu_model() works great and definitely provides a speed up. It just doesn't play nice with Model Checkpoint, or weight saving/loading.

@pGit1 Yea, what you have there should work. Or you can can use the keras.utils.multi_gpu_model so create a wrapper class:

from keras import Model
from keras.utils import multi_gpu_model


class ModelMGPU(Model):
    def __init__(self, ser_model, gpus):
        pmodel = multi_gpu_model(ser_model, gpus)
        self.__dict__.update(pmodel.__dict__)
        self._smodel = ser_model

    def __getattribute__(self, attrname):
        '''Override load and save methods to be used from the serial-model. The
        serial-model holds references to the weights in the multi-gpu model.
        '''
        # return Model.__getattribute__(self, attrname)
        if 'load' in attrname or 'save' in attrname:
            return getattr(self._smodel, attrname)

        return super(ModelMGPU, self).__getattribute__(attrname)

Then you can use your example above with this new class.

# model topology instantiation above
ser_model = Keras.models.Model(inputs = x, output=out)
parallel_model = ModelMGPU(ser_model , 4)

#callback to save best weights
mod_wt_path = './best_weights.hdf5'
best_wts_callback = callbacks.ModelCheckpoint(mod_wt_path, save_weights_only=True, save_best_only=True)

# compile the parallel model prior to fit
parallel_model.fit(X, y, callbacks=[best_wts_callback])

#Now I want to infer on single GPU so I load saved weights ??
ser_model.load_weights(mod_wt_path)

# I think you might have to compile the serial model prior to predict
ser_model.predict(X_holdout)

@avolkov1

THANK YOU!! Your code works. To test I bypassed multi-gpu-model altogether.
I used raw code from https://github.com/avolkov1/keras_experiments/blob/master/keras_exp/multigpu/_multigpu.py#L129.

After training on a simple dummy data set, I call the function a function that returns two models (serial and parallel) and only choose the serial_model. Keep in mind during training I call the fit function with the parallel model not the serial model. I also feed my best weight callback to the parallel model during training.

Once this is done I load the learned weights into the serial model and get the expected results without any errors. I am not entirely sure why this works but it does. I confirmed multi-gpu training and single gpu inference. Now I am going to clean up my code to do something like you outline above.

Thanks again for your help!!

EDIT: The cleaned up version where you wrap the multi-gpu-model class works flawlessly. This is definitely my preferred method. Thanks again for all of your help. Your code is an extremely valuable contribution.

EDIT on Jan 11, 2019
@avolkov1, I found the problem after I reported that I tried your approach and hit an issue with tensorflow keras 1.11, the mistake I made was to save the entire model with save_weights_only=False. In a result, the weights in the model are saved in a messed-up order that Keras code cannot read.

I've tried to customize ModelCheckPoint, however, the optimizer states are not saved correctly and I'm unable to resume the training properly. I'd say saving the template model for every N epochs instead of using the checkpoint and calling fit() every N epochs to resume the training. It's the most mundane way but I think it's the safest way that preserves the model's / optimizer's weights.

@fchollet @pGit1 I @nicolefinnie @@avolkov1, I solved the problem using the following way. I changed some lines in the major codes of keras (particularly in topology.py or network.py, and callbacks.py). Here, I just modified the following codes.
Reminder: You need to replace 'save_weights_to_hdf5_group' with 'saving.save_weights_to_hdf5_group(f, layers)' if you use the recent version of Keras.

Callbacks.py:

class ModelCheckpoint(Callback):
"""Save the model after every epoch.

`filepath` can contain named formatting options,
which will be filled the value of `epoch` and
keys in `logs` (passed in `on_epoch_end`).

For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
then the model checkpoints will be saved with the epoch number and
the validation loss in the filename.

# Arguments
    filepath: string, path to save the model file.
    monitor: quantity to monitor.
    verbose: verbosity mode, 0 or 1.
    save_best_only: if `save_best_only=True`,
        the latest best model according to
        the quantity monitored will not be overwritten.
    mode: one of {auto, min, max}.
        If `save_best_only=True`, the decision
        to overwrite the current save file is made
        based on either the maximization or the
        minimization of the monitored quantity. For `val_acc`,
        this should be `max`, for `val_loss` this should
        be `min`, etc. In `auto` mode, the direction is
        automatically inferred from the name of the monitored quantity.
    save_weights_only: if True, then only the model's weights will be
        saved (`model.save_weights(filepath)`), else the full model
        is saved (`model.save(filepath)`).
    period: Interval (number of epochs) between checkpoints.
"""

def __init__(self, filepath, monitor='val_loss', verbose=0,
             save_best_only=False, save_weights_only=False,
             mode='auto', period=1, multi_gpu_mode=False, name_of_model=None):
    super(ModelCheckpoint, self).__init__()
    self.monitor = monitor
    self.verbose = verbose
    self.filepath = filepath
    self.save_best_only = save_best_only
    self.save_weights_only = save_weights_only
    self.name_of_model = name_of_model # Usually model_1, you can check the name by calling summary after running multi_gpu_model
    self.multi_gpu_mode = multi_gpu_mode
    self.period = period
    self.epochs_since_last_save = 0

    if mode not in ['auto', 'min', 'max']:
        warnings.warn('ModelCheckpoint mode %s is unknown, '
                      'fallback to auto mode.' % (mode),
                      RuntimeWarning)
        mode = 'auto'

    if mode == 'min':
        self.monitor_op = np.less
        self.best = np.Inf
    elif mode == 'max':
        self.monitor_op = np.greater
        self.best = -np.Inf
    else:
        if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            self.monitor_op = np.less
            self.best = np.Inf

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch + 1, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch + 1, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.model.save_weights(filepath, overwrite=True,  multiple_gpu=self.multi_gpu_mode, name_of_model=self.name_of_model)
                    else:
                        self.model.save(filepath, overwrite=True)
                else:
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s did not improve from %0.5f' %
                              (epoch + 1, self.monitor, self.best))
        else:
            if self.verbose > 0:
                print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
            if self.save_weights_only:
                self.model.save_weights(filepath, overwrite=True, multiple_gpu=self.multi_gpu_mode)
            else:
                self.model.save(filepath, overwrite=True)

Topology.py/network.py:

def save_weights(self, filepath, overwrite=True, multiple_gpu=False, name_of_model=""):
    """Dumps all layer weights to a HDF5 file.

    The weight file has:
        - `layer_names` (attribute), a list of strings
            (ordered names of model layers).
        - For every layer, a `group` named `layer.name`
            - For every such layer group, a group attribute `weight_names`,
                a list of strings
                (ordered names of weights tensor of the layer).
            - For every weight in the layer, a dataset
                storing the weight value, named after the weight tensor.

    # Arguments
        filepath: String, path to the file to save the weights to.
        overwrite: Whether to silently overwrite any existing file at the
            target location, or provide the user with a manual prompt.

    # Raises
        ImportError: If h5py is not available.
    """
    if h5py is None:
        raise ImportError('`save_weights` requires h5py.')
    # If file exists and should not be overwritten:
    if not overwrite and os.path.isfile(filepath):
        proceed = ask_to_proceed_with_overwrite(filepath)
        if not proceed:
            return
    with h5py.File(filepath, 'w') as f:
        if multiple_gpu:
            layers = self.get_layer(name_of_model)
            layers = layers.layers
            save_weights_to_hdf5_group(f, layers)
        else:
            save_weights_to_hdf5_group(f, self.layers)
        f.flush()
Was this page helpful?
0 / 5 - 0 ratings