Keras: How to train multiple Keras models simultaneously on one GPU?

Created on 22 Jul 2016 · 13Comments · Source: keras-team/keras

Hi. I have 10 different data set, and I hope to train 10 models with different data set respectively.
Those model have the same layers, and my codes are as follows:

model_sequence = Sequential()
model_sequence.add(Masking(mask_value=2, input_shape=(input_length, input_dim)))
model_sequence.add(LSTM(150))
model_sequence.add(Dense(20))
My GPU is GTX TITAN X.I wonder can I train the 10 models at the same time using GPU? If I can, how should I write the codes?
Thank you very much for your assistance!

stale

Source

Yingyingzhang15

👍4

Most helpful comment

@Yingyingzhang15

Man, I'm training same model right now. It have different numbers of parameters (dimentions input and output), but it same like described @lemuriandezapada

You need different Input and different outputs. Finally, you merged it to list (list for input and list for output) and send to fit or fit_generator (need to write custom generator). I used this:

Data format

    X       = np.zeros(count * PixCount * RGBCount * num_out_parameters).reshape(num_out_parameters, count, RGBCount, Ysize, Xsize)
    X       = X.astype('float32')

    Y       = np.zeros(count * nb_classes * num_out_parameters).reshape(num_out_parameters, count, nb_classes)

where _num_out_parameters_ is number of dimensions (input and same outputs)

Generator call

        model.fit_generator(util.flow(X, Y, batch_size = cfg.batch_size),
                                    samples_per_epoch = cfg.spe,
                                    nb_epoch = cfg.nb_epoch,
                                    validation_data=([X_test for i in range(cfg.num_out_parameters)], 
                                    [Y_test[cfg.start_parameter + i] for i in range(cfg.num_out_parameters)]),
                                    callbacks = [best_model, best_model_ep, TensorBoard(cfg.tmp_file+now)])

Flow function

def flow(X, Y, batch_size):
    while 1:

        # this will do preprocessing and realtime data augmentation
        datagen = ImageDataGenerator(
            featurewise_center = False,  # set input mean to 0 over the dataset
            samplewise_center = False,  # set each sample mean to 0
            featurewise_std_normalization = False,  # divide inputs by std of the dataset
            samplewise_std_normalization = False,  # divide each input by its std
            zca_whitening = False,  # apply ZCA whitening
            rotation_range = 5,  # randomly rotate images in the range (degrees, 0 to 180)
            width_shift_range = 0.05,  # randomly shift images horizontally (fraction of total width)
            height_shift_range = 0.05,  # randomly shift images vertically (fraction of total height)
            horizontal_flip = True,  # randomly flip images
            vertical_flip = False)  # randomly flip images


        Xn = []
        Yn = []
        for i in range(X.shape[0]):
            datagen.fit(X[i])
            batches = datagen.flow(X[i], Y[i], batch_size = batch_size)

            for batch in batches:

                Xn.append(batch[0])
                Yn.append(batch[1])

                break

        yield Xn, Yn

And finally when you get multiple GPU's you can easily parallel this model like this:

        for i in range(cfg.num_out_parameters):
                        # gpu:0 gpu:1 etc
            with tf.device(device):
                InList, OutList = rn.residual_model(...)

                InL.append(InList)
                OutL.append(OutList)

        with tf.device('/cpu:0'):
            model = Model(input=InL, output=OutL)

May be is't ideal code, I'm noob in NN but it works!

vrodionovpro on 24 Aug 2016

👍6

All 13 comments

@fchollet @tboquet @EderSantana Can you help me?

Yingyingzhang15 on 22 Jul 2016

Multi GPU support is still in progress. If the models are different (no gradient sharing) and the data is different, what is kind of parallelism you are expecting?

farizrahman4u on 22 Jul 2016

👎15

@farizrahman4u Thank you for your reply.Actually, I have only one GPU. What I want is training 10 models simultaneously to save the total training time. I have tried to run several code files together, but the training speed is much slower than that when I run each code file individually.

So I wonder is there any method to decrease the total training time of 10 models?

Yingyingzhang15 on 22 Jul 2016

Here are some speed-ups for GPU:

Setting the THEANO_FLAGS to allow_gc=False, nvcc.fastmath=True, openmp=True, lib.cnmem=0.75
Using fit_generator with a thread-safe generator (data processing is executed on CPU while the model is trained on GPU)

The flag lib.cnmem sets the fraction of GPU memory to allocate. Then you should make sure the total allocated memory is not more than 1 (though I took the habit to make the total less than 0.75).

Even then the training time of each model would be slowed down compared to being run individually. Then you have to see how many models it's still worth to run at the same time (I would say 2 or 3), or decide to just run them sequentially.

denlecoeuche on 22 Jul 2016

In that case i think its best to train one after the other.

farizrahman4u on 22 Jul 2016

you might just want to try using the functional api to make 10 models and then compile them into one model

in_1 = Input()
lstm_1 = LSTM(...)(in_1)
out_1 = Dense(...)(lstm_1)

in_2 = Input()
lstm_2 = LSTM(...)(in_2)
out_2 = Dense(...)(lstm_2)

model_1 = Model(input=in_1, output=out_1)
model_2 = Model(input=in_2, output=out_2)

model = Model(input = [in_1, in_2], output = [out_1, out_2])
model.compile(...)
model.fit(...)

model_1.predict(...)
model_2.predict(...)

I haven't tried it, but in principle it should work. Although it might warn of disconnected inputs.

lemuriandezapada on 27 Jul 2016

👎15 👍5

@Yingyingzhang15

Man, I'm training same model right now. It have different numbers of parameters (dimentions input and output), but it same like described @lemuriandezapada

You need different Input and different outputs. Finally, you merged it to list (list for input and list for output) and send to fit or fit_generator (need to write custom generator). I used this:

Data format

    X       = np.zeros(count * PixCount * RGBCount * num_out_parameters).reshape(num_out_parameters, count, RGBCount, Ysize, Xsize)
    X       = X.astype('float32')

    Y       = np.zeros(count * nb_classes * num_out_parameters).reshape(num_out_parameters, count, nb_classes)

where _num_out_parameters_ is number of dimensions (input and same outputs)

Generator call

        model.fit_generator(util.flow(X, Y, batch_size = cfg.batch_size),
                                    samples_per_epoch = cfg.spe,
                                    nb_epoch = cfg.nb_epoch,
                                    validation_data=([X_test for i in range(cfg.num_out_parameters)], 
                                    [Y_test[cfg.start_parameter + i] for i in range(cfg.num_out_parameters)]),
                                    callbacks = [best_model, best_model_ep, TensorBoard(cfg.tmp_file+now)])

Flow function

def flow(X, Y, batch_size):
    while 1:

        # this will do preprocessing and realtime data augmentation
        datagen = ImageDataGenerator(
            featurewise_center = False,  # set input mean to 0 over the dataset
            samplewise_center = False,  # set each sample mean to 0
            featurewise_std_normalization = False,  # divide inputs by std of the dataset
            samplewise_std_normalization = False,  # divide each input by its std
            zca_whitening = False,  # apply ZCA whitening
            rotation_range = 5,  # randomly rotate images in the range (degrees, 0 to 180)
            width_shift_range = 0.05,  # randomly shift images horizontally (fraction of total width)
            height_shift_range = 0.05,  # randomly shift images vertically (fraction of total height)
            horizontal_flip = True,  # randomly flip images
            vertical_flip = False)  # randomly flip images


        Xn = []
        Yn = []
        for i in range(X.shape[0]):
            datagen.fit(X[i])
            batches = datagen.flow(X[i], Y[i], batch_size = batch_size)

            for batch in batches:

                Xn.append(batch[0])
                Yn.append(batch[1])

                break

        yield Xn, Yn

And finally when you get multiple GPU's you can easily parallel this model like this:

        for i in range(cfg.num_out_parameters):
                        # gpu:0 gpu:1 etc
            with tf.device(device):
                InList, OutList = rn.residual_model(...)

                InL.append(InList)
                OutL.append(OutList)

        with tf.device('/cpu:0'):
            model = Model(input=InL, output=OutL)

May be is't ideal code, I'm noob in NN but it works!

vrodionovpro on 24 Aug 2016

👍6

@vrodionovpro Thank you so much for your explain! :)

Yingyingzhang15 on 6 Sep 2016

Only somewhat related to this topic (my apologies), but this stackoverflow question could use an answer: http://stackoverflow.com/questions/40850089/is-keras-thread-safe

mrocklin on 18 Feb 2017

👍1

@lemuriandezapada Creating a parallel model is a smart move, but this removes some randomness, which might be important when training an ensemble (which I guess is why people want this anyway). Training an ensemble can improve performance due to some heterogeneity of the models; if all (sub)models come up with exactly the same result, ensembles are useless.

The batches fed to each network will be randomized, but the same for each model, and the only symmetry-breaking you have is from the random weight initializations. This might have some nasty consequences on the distribution of the trained models. It might work 'better' if you also create $n$ reshuffled data sets, even though there will still be some correlation between the batches of different models across epochs. Maybe one would need to create an infinite batch generator?

prhbrt on 13 Mar 2017

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

stale[bot] on 12 Jun 2017

Has there been a solution? @vrodionovpro 's response is good, but now dated and I'm not sure where the parallelism comes in. There's some loose solutions if you're okay with combining models into one, but I'd be looking for a multi-threading-esque way to train several models concurrently.

This link is also close, but it's more about maximizing one GPU than several, if it in fact possible to have one GPU split into concurrent processes like that (I'm not sure).

Rcuz8 on 31 Jul 2020

model_1 = Model(input=in_1, output=out_1)
model_2 = Model(input=in_2, output=out_2)

you might just want to try using the functional api to make 10 models and then compile them into one model

in_1 = Input()
lstm_1 = LSTM(...)(in_1)
out_1 = Dense(...)(lstm_1)

in_2 = Input()
lstm_2 = LSTM(...)(in_2)
out_2 = Dense(...)(lstm_2)

model_1 = Model(input=in_1, output=out_1)
model_2 = Model(input=in_2, output=out_2)

model = Model(input = [in_1, in_2], output = [out_1, out_2])
model.compile(...)
model.fit(...)

model_1.predict(...)
model_2.predict(...)

I haven't tried it, but in principle it should work. Although it might warn of disconnected inputs.