Keras: does it automatically use multiple gpu, if availabe?

Created on 10 May 2015 · 13Comments · Source: keras-team/keras

In Theano, we have to munally fork the process: https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs .

Is it possible to configure keras sequence using multiple gpu to train (one) model?

I am consodering this because I have multiple gpu's on a single machine.

Source

fyears

Most helpful comment

@fyears @fchollet I have the following code for data parallelism for Keras. The idea is to manually synchronize the model (by averaging) from multiple GPUs after each training batch:

import warnings
import multiprocessing
import numpy as np
from datetime import datetime

# constant
NUM_GPU=4
NUM_EPOCH=100
MINI_BATCH=128

def train_model(gpu_id, data_queue, model_queue, num_epoch=NUM_EPOCH, num_batch=1):
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(gpu_id)
    import theano
    import theano.tensor as T
    from keras.models import Sequential
    from keras.layers.convolutional import Convolution2D
    from demosaic_cnn import Crop2D, mean_absolute_error

    # define the model
    model=Sequential()
    # put your model definition here

    # compile the model
    model.compile(loss=mean_absolute_error, optimizer='RMSprop')

    # train the model
    best_loss=np.inf
    best_save='_'.join((gpu_id,datetime.now().strftime('%Y_%m_%d_%H_%M_%S.h5')))
    for epoch in range(num_epoch):
        print gpu_id,'@epoch',epoch
        for batch in range(num_batch):
            print gpu_id,'@batch',batch
            data=data_queue.get()
            loss=model.train_on_batch(data[0], data[1])
            # after a batch a data, synchronize the model
            model_weight=[layer.get_weights() for layer in model.layers]
            # we need to send NUM_GPU-1 copies out
            for i in range(1,NUM_GPU):
                model_queue[gpu_id].put(model_weight)
            for k in model_queue:
                if k==gpu_id:
                    continue
                # obtain the model from other GPU
                weight=model_queue[k].get()
                # sum it
                for l,w in enumerate(weight):
                    model_weight[l]=[w1+w2 for w1,w2 in zip(model_weight[l],w)]
            # average it
            for l,w in enumerate(model_weight):
                model.layers[l].set_weights([d/NUM_GPU for d in w])
        # after each epoch, try to save the current best model
        if best_loss>loss:
            model.save_weights(best_save, overwrite=True)
            best_loss=loss
    model_queue[gpu_id].close()

if __name__=='__main__':
    data=[]
    label=[]
    num_data=len(data)
    gpu_list=['gpu{}'.format(i) for i in range(NUM_GPU)]
    # for send the data
    data_queue=multiprocessing.Queue(20)
    # for synchronize the model, we create a queue for each model
    model_queue={gpu_id:multiprocessing.Queue(2) for gpu_id in gpu_list}
    threads=[multiprocessing.Process(target=train_model(gpu_id, data_queue, model_queue, num_batch=(num_data)/MINI_BATCH/NUM_GPU)) for gpu_id in gpu_list]
    for thread in threads:
        thread.start()
    for epoch in range(NUM_EPOCH):
        print 'data@epoch',epoch
        for start in range(0,num_data,MINI_BATCH):
            print 'data@batch',start/MINI_BATCH
            data_queue.put((data[start:(start+MINI_BATCH)],label[start:(start+MINI_BATCH)]))
    data_queue.close()
    for thread in threads:
        thread.join()

zhangtemplar on 16 Nov 2015

👍4

All 13 comments

This has never been tried with Keras.

It's not entirely clear what training a Keras model on multiple GPUs would imply. Would we do data parallelism? Model parallelism?

You are more than welcome to investigate these issues if that's something you're interested in : )

fchollet on 10 May 2015

@fyears @fchollet I have the following code for data parallelism for Keras. The idea is to manually synchronize the model (by averaging) from multiple GPUs after each training batch:

import warnings
import multiprocessing
import numpy as np
from datetime import datetime

# constant
NUM_GPU=4
NUM_EPOCH=100
MINI_BATCH=128

def train_model(gpu_id, data_queue, model_queue, num_epoch=NUM_EPOCH, num_batch=1):
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(gpu_id)
    import theano
    import theano.tensor as T
    from keras.models import Sequential
    from keras.layers.convolutional import Convolution2D
    from demosaic_cnn import Crop2D, mean_absolute_error

    # define the model
    model=Sequential()
    # put your model definition here

    # compile the model
    model.compile(loss=mean_absolute_error, optimizer='RMSprop')

    # train the model
    best_loss=np.inf
    best_save='_'.join((gpu_id,datetime.now().strftime('%Y_%m_%d_%H_%M_%S.h5')))
    for epoch in range(num_epoch):
        print gpu_id,'@epoch',epoch
        for batch in range(num_batch):
            print gpu_id,'@batch',batch
            data=data_queue.get()
            loss=model.train_on_batch(data[0], data[1])
            # after a batch a data, synchronize the model
            model_weight=[layer.get_weights() for layer in model.layers]
            # we need to send NUM_GPU-1 copies out
            for i in range(1,NUM_GPU):
                model_queue[gpu_id].put(model_weight)
            for k in model_queue:
                if k==gpu_id:
                    continue
                # obtain the model from other GPU
                weight=model_queue[k].get()
                # sum it
                for l,w in enumerate(weight):
                    model_weight[l]=[w1+w2 for w1,w2 in zip(model_weight[l],w)]
            # average it
            for l,w in enumerate(model_weight):
                model.layers[l].set_weights([d/NUM_GPU for d in w])
        # after each epoch, try to save the current best model
        if best_loss>loss:
            model.save_weights(best_save, overwrite=True)
            best_loss=loss
    model_queue[gpu_id].close()

if __name__=='__main__':
    data=[]
    label=[]
    num_data=len(data)
    gpu_list=['gpu{}'.format(i) for i in range(NUM_GPU)]
    # for send the data
    data_queue=multiprocessing.Queue(20)
    # for synchronize the model, we create a queue for each model
    model_queue={gpu_id:multiprocessing.Queue(2) for gpu_id in gpu_list}
    threads=[multiprocessing.Process(target=train_model(gpu_id, data_queue, model_queue, num_batch=(num_data)/MINI_BATCH/NUM_GPU)) for gpu_id in gpu_list]
    for thread in threads:
        thread.start()
    for epoch in range(NUM_EPOCH):
        print 'data@epoch',epoch
        for start in range(0,num_data,MINI_BATCH):
            print 'data@batch',start/MINI_BATCH
            data_queue.put((data[start:(start+MINI_BATCH)],label[start:(start+MINI_BATCH)]))
    data_queue.close()
    for thread in threads:
        thread.join()

zhangtemplar on 16 Nov 2015

👍4

@zhangtemplar, thanks for sharing the data parallelism code.

Are you sure that adding and averaging the weights needs to be done after each iteration of the batch loop instead of after the batch loop?

According to Simonya, Zisserman ,

After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.

UPDATE:
Just realised that you're already doing it in your code!

webeng on 3 Feb 2016

Just in case someone is interrested, there is the platoon repo:

https://github.com/mila-udem/platoon

that implement ASGD and EASGD. It have been tested with Theano, but was made to not depend on Theano.

nouiz on 3 Feb 2016

Thanks @nouiz, I'm checking it out.

How are the parameters synchronised, are you averaging them?

webeng on 3 Feb 2016

I think it is better that you ask question there.

On Wed, Feb 3, 2016 at 11:26 AM, Joan [email protected] wrote:

Thanks @nouiz https://github.com/nouiz, I'm checking it out.

How are the parameters synchronised, are you averaging them?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/106#issuecomment-179323708.

nouiz on 3 Feb 2016

Considering that most optimizers have adaptive step sizes, it seems wrong to assume that model averaging is in general equivalent to gradient averaging (as done in Simonyan&Zisserman). Wouldn't it be a better idea to change Keras' architecture a bit, so that optimizers are given the gradient as argument instead of calling get_gradients() themselves? That would simplify distributing gradient computation a lot.

mguillau on 9 Mar 2016

@zhangtemplar , thank you for the code. When I run your code, it shows:
gpu0 @epoch 0 gpu0 @batch 0

and remains there forever!
Do you have any idea why it does not assign other gpus to other batches?
FYI, I have five GPUs on my machine.

Thanks.

ghost on 25 Jul 2016

@saliakbarian Sorry, I don't know the reason as I have switched to tensorflow recently.

zhangtemplar on 25 Jul 2016

@zhangtemplar Hi, if you are using keras+tensorflow, can you share some sample code training with multiple gpus? Thank you.

WeiNiu on 26 Jul 2016

Hi, @WeiNiu no, I use tensorflow alone, which supports multiple GPU natively. But I think its (multi) GPU(s) efficiency is not very good.

zhangtemplar on 26 Jul 2016

@zhangtemplar Thank you for your codes. While I am really wanna to do data parallelism in Keras with TensorFlow, do you have any idea how to achieve that with multiple GPUs in a single machine. If you are only using TensorFlow alone, could you please indicate more details. I found docs in official pages unclear. Thank you .

pengpaiSH on 21 Aug 2016

@pengpaiSH I think it is easier to do multiple gpu with tensorflow, where you can specify the GPU ID for namespace. Please refer to (https://www.tensorflow.org/versions/r0.10/how_tos/using_gpu/index.html) for more details.

zhangtemplar on 21 Aug 2016

Was this page helpful?

0 / 5 - 0 ratings