Keras: does it automatically use multiple gpu, if availabe?

Created on 10 May 2015  Â·  13Comments  Â·  Source: keras-team/keras

In Theano, we have to munally fork the process: https://github.com/Theano/Theano/wiki/Using-Multiple-GPUs .

Is it possible to configure keras sequence using multiple gpu to train (one) model?

I am consodering this because I have multiple gpu's on a single machine.

Most helpful comment

@fyears @fchollet I have the following code for data parallelism for Keras. The idea is to manually synchronize the model (by averaging) from multiple GPUs after each training batch:

import warnings
import multiprocessing
import numpy as np
from datetime import datetime

# constant
NUM_GPU=4
NUM_EPOCH=100
MINI_BATCH=128

def train_model(gpu_id, data_queue, model_queue, num_epoch=NUM_EPOCH, num_batch=1):
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(gpu_id)
    import theano
    import theano.tensor as T
    from keras.models import Sequential
    from keras.layers.convolutional import Convolution2D
    from demosaic_cnn import Crop2D, mean_absolute_error

    # define the model
    model=Sequential()
    # put your model definition here

    # compile the model
    model.compile(loss=mean_absolute_error, optimizer='RMSprop')

    # train the model
    best_loss=np.inf
    best_save='_'.join((gpu_id,datetime.now().strftime('%Y_%m_%d_%H_%M_%S.h5')))
    for epoch in range(num_epoch):
        print gpu_id,'@epoch',epoch
        for batch in range(num_batch):
            print gpu_id,'@batch',batch
            data=data_queue.get()
            loss=model.train_on_batch(data[0], data[1])
            # after a batch a data, synchronize the model
            model_weight=[layer.get_weights() for layer in model.layers]
            # we need to send NUM_GPU-1 copies out
            for i in range(1,NUM_GPU):
                model_queue[gpu_id].put(model_weight)
            for k in model_queue:
                if k==gpu_id:
                    continue
                # obtain the model from other GPU
                weight=model_queue[k].get()
                # sum it
                for l,w in enumerate(weight):
                    model_weight[l]=[w1+w2 for w1,w2 in zip(model_weight[l],w)]
            # average it
            for l,w in enumerate(model_weight):
                model.layers[l].set_weights([d/NUM_GPU for d in w])
        # after each epoch, try to save the current best model
        if best_loss>loss:
            model.save_weights(best_save, overwrite=True)
            best_loss=loss
    model_queue[gpu_id].close()

if __name__=='__main__':
    data=[]
    label=[]
    num_data=len(data)
    gpu_list=['gpu{}'.format(i) for i in range(NUM_GPU)]
    # for send the data
    data_queue=multiprocessing.Queue(20)
    # for synchronize the model, we create a queue for each model
    model_queue={gpu_id:multiprocessing.Queue(2) for gpu_id in gpu_list}
    threads=[multiprocessing.Process(target=train_model(gpu_id, data_queue, model_queue, num_batch=(num_data)/MINI_BATCH/NUM_GPU)) for gpu_id in gpu_list]
    for thread in threads:
        thread.start()
    for epoch in range(NUM_EPOCH):
        print 'data@epoch',epoch
        for start in range(0,num_data,MINI_BATCH):
            print 'data@batch',start/MINI_BATCH
            data_queue.put((data[start:(start+MINI_BATCH)],label[start:(start+MINI_BATCH)]))
    data_queue.close()
    for thread in threads:
        thread.join()

All 13 comments

This has never been tried with Keras.

It's not entirely clear what training a Keras model on multiple GPUs would imply. Would we do data parallelism? Model parallelism?

You are more than welcome to investigate these issues if that's something you're interested in : )

@fyears @fchollet I have the following code for data parallelism for Keras. The idea is to manually synchronize the model (by averaging) from multiple GPUs after each training batch:

import warnings
import multiprocessing
import numpy as np
from datetime import datetime

# constant
NUM_GPU=4
NUM_EPOCH=100
MINI_BATCH=128

def train_model(gpu_id, data_queue, model_queue, num_epoch=NUM_EPOCH, num_batch=1):
    import theano.sandbox.cuda
    theano.sandbox.cuda.use(gpu_id)
    import theano
    import theano.tensor as T
    from keras.models import Sequential
    from keras.layers.convolutional import Convolution2D
    from demosaic_cnn import Crop2D, mean_absolute_error

    # define the model
    model=Sequential()
    # put your model definition here

    # compile the model
    model.compile(loss=mean_absolute_error, optimizer='RMSprop')

    # train the model
    best_loss=np.inf
    best_save='_'.join((gpu_id,datetime.now().strftime('%Y_%m_%d_%H_%M_%S.h5')))
    for epoch in range(num_epoch):
        print gpu_id,'@epoch',epoch
        for batch in range(num_batch):
            print gpu_id,'@batch',batch
            data=data_queue.get()
            loss=model.train_on_batch(data[0], data[1])
            # after a batch a data, synchronize the model
            model_weight=[layer.get_weights() for layer in model.layers]
            # we need to send NUM_GPU-1 copies out
            for i in range(1,NUM_GPU):
                model_queue[gpu_id].put(model_weight)
            for k in model_queue:
                if k==gpu_id:
                    continue
                # obtain the model from other GPU
                weight=model_queue[k].get()
                # sum it
                for l,w in enumerate(weight):
                    model_weight[l]=[w1+w2 for w1,w2 in zip(model_weight[l],w)]
            # average it
            for l,w in enumerate(model_weight):
                model.layers[l].set_weights([d/NUM_GPU for d in w])
        # after each epoch, try to save the current best model
        if best_loss>loss:
            model.save_weights(best_save, overwrite=True)
            best_loss=loss
    model_queue[gpu_id].close()

if __name__=='__main__':
    data=[]
    label=[]
    num_data=len(data)
    gpu_list=['gpu{}'.format(i) for i in range(NUM_GPU)]
    # for send the data
    data_queue=multiprocessing.Queue(20)
    # for synchronize the model, we create a queue for each model
    model_queue={gpu_id:multiprocessing.Queue(2) for gpu_id in gpu_list}
    threads=[multiprocessing.Process(target=train_model(gpu_id, data_queue, model_queue, num_batch=(num_data)/MINI_BATCH/NUM_GPU)) for gpu_id in gpu_list]
    for thread in threads:
        thread.start()
    for epoch in range(NUM_EPOCH):
        print 'data@epoch',epoch
        for start in range(0,num_data,MINI_BATCH):
            print 'data@batch',start/MINI_BATCH
            data_queue.put((data[start:(start+MINI_BATCH)],label[start:(start+MINI_BATCH)]))
    data_queue.close()
    for thread in threads:
        thread.join()

@zhangtemplar, thanks for sharing the data parallelism code.

Are you sure that adding and averaging the weights needs to be done after each iteration of the batch loop instead of after the batch loop?

According to Simonya, Zisserman ,

After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU.

UPDATE:
Just realised that you're already doing it in your code!

Just in case someone is interrested, there is the platoon repo:

https://github.com/mila-udem/platoon

that implement ASGD and EASGD. It have been tested with Theano, but was made to not depend on Theano.

Thanks @nouiz, I'm checking it out.

How are the parameters synchronised, are you averaging them?

I think it is better that you ask question there.

On Wed, Feb 3, 2016 at 11:26 AM, Joan [email protected] wrote:

Thanks @nouiz https://github.com/nouiz, I'm checking it out.

How are the parameters synchronised, are you averaging them?

—
Reply to this email directly or view it on GitHub
https://github.com/fchollet/keras/issues/106#issuecomment-179323708.

Considering that most optimizers have adaptive step sizes, it seems wrong to assume that model averaging is in general equivalent to gradient averaging (as done in Simonyan&Zisserman). Wouldn't it be a better idea to change Keras' architecture a bit, so that optimizers are given the gradient as argument instead of calling get_gradients() themselves? That would simplify distributing gradient computation a lot.

@zhangtemplar , thank you for the code. When I run your code, it shows:
gpu0 @epoch 0 gpu0 @batch 0

and remains there forever!
Do you have any idea why it does not assign other gpus to other batches?
FYI, I have five GPUs on my machine.

Thanks.

@saliakbarian Sorry, I don't know the reason as I have switched to tensorflow recently.

@zhangtemplar Hi, if you are using keras+tensorflow, can you share some sample code training with multiple gpus? Thank you.

Hi, @WeiNiu no, I use tensorflow alone, which supports multiple GPU natively. But I think its (multi) GPU(s) efficiency is not very good.

@zhangtemplar Thank you for your codes. While I am really wanna to do data parallelism in Keras with TensorFlow, do you have any idea how to achieve that with multiple GPUs in a single machine. If you are only using TensorFlow alone, could you please indicate more details. I found docs in official pages unclear. Thank you .

@pengpaiSH I think it is easier to do multiple gpu with tensorflow, where you can specify the GPU ID for namespace. Please refer to (https://www.tensorflow.org/versions/r0.10/how_tos/using_gpu/index.html) for more details.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Imorton-zd picture Imorton-zd  Â·  3Comments

farizrahman4u picture farizrahman4u  Â·  3Comments

vinayakumarr picture vinayakumarr  Â·  3Comments

anjishnu picture anjishnu  Â·  3Comments

kylemcdonald picture kylemcdonald  Â·  3Comments