Keras: [BUG] Save Model with multi_gpu

Created on 12 Oct 2017 · 58Comments · Source: keras-team/keras

Hey guys,

the new multi_gpu feature seems to have a bug. If you want to save the model you get an error like the one below. To reproduce just run the test multi_gpu_test_simple_model() with parallel_model.save("logs/model.h5") at the end.

def multi_gpu_test_simple_model():
    print('####### test simple model')
    num_samples = 1000
    input_dim = 10
    output_dim = 1
    hidden_dim = 10
    gpus = 4
    epochs = 2
    model = keras.models.Sequential()
    model.add(keras.layers.Dense(hidden_dim,
                                 input_shape=(input_dim,)))
    model.add(keras.layers.Dense(output_dim))

    x = np.random.random((num_samples, input_dim))
    y = np.random.random((num_samples, output_dim))
    parallel_model = multi_gpu_model(model, gpus=gpus)

    parallel_model.compile(loss='mse', optimizer='rmsprop')
    parallel_model.fit(x, y, epochs=epochs)

    parallel_model.save("logs/model.h5")


multi_gpu_test_simple_model()

1000/1000 [==============================] - ETA: 0s - loss: 0.4537
Epoch 2/2
1000/1000 [==============================] - ETA: 0s - loss: 0.2939
Traceback (most recent call last):
File "steps_kt/test.py", line 43, in
multi_gpu_test_simple_model()
File "steps_kt/test.py", line 40, in multi_gpu_test_simple_model
parallel_model.save("logs/model.h5")
File "/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py", line 2555, in save
File "/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/models.py", line 107, in save_model
File "/home/y0052080/pyenv/lib/python3.6/site-packages/Keras-2.0.8-py3.6.egg/keras/engine/topology.py", line 2396, in get_config
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 215, in _deepcopy_list
append(deepcopy(a, memo))
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/cluster/tools/python3/lib/python3.6/copy.py", line 220, in
y = [deepcopy(a, memo) for a in x]
File "/cluster/tools/python3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/cluster/tools/python3/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/cluster/tools/python3/lib/python3.6/copy.py", line 220, in
y = [deepcopy(a, memo) for a in x]
File "/cluster/tools/python3/lib/python3.6/copy.py", line 169, in deepcopy
rv = reductor(4)
TypeError: can't pickle module objects

Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

[ x] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[ x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

Source

PBehr

👍20

Most helpful comment

I just faced the same issue here. In https://keras.io/utils/#multi_gpu_model it clearly stated that the model can be used like the normal model, but it cannot be saved, very funny. I can't even perform reinforced training just because I cannot save the previous model trained with multiple GPUs. If trained with single GPU, the rest of my invested GPUs will become useless. Please urge the developer to look into this bug ASAP.

hinata9276 on 1 Nov 2017

👍28 😕1

All 58 comments

import tensorflow as tf in multi_gpu_model causes the error

PBehr on 12 Oct 2017

I had a similar problem using ModelCheckpoint to save best models. rmkemker suggested a multiGPU version of ModelCheckPoint here which worked for me.

pawarrick on 12 Oct 2017

I think this is a different Issue. I have a function for multi gpu which is pretty similar to the one in Keras. So it was easy to find the issue.
It works if you uncommend the import line of tensorflow and import it outside the function.

Pickle tries to dump a tensorflow object?! Something like that.

I'm not sure how to solve it for general use though. We would need to import tensorflow outside the function but that would be a problem for people using a different backend.

PBehr on 12 Oct 2017

I have the same problem while trying to model.save('..'). a _parallel_model_. I also first discovered while using ModelCheckpoint callback to save results between epochs. However if I used the option ModelCheckpoint(..., save_weights_only=True), to use model.save_weights() it seems to work.

fernandoandreotti on 19 Oct 2017

👍2

I encounter this problem, too

JianboTang on 20 Oct 2017

me too

Entonytang on 26 Oct 2017

hinata9276 on 1 Nov 2017

👍28 😕1

This is related to #8253

PBehr on 4 Nov 2017

Hi. Seems like I’ve found the solution for mine case. Just compile the base model, then transfer the trained weights of GPU model back to base model itself, then it was able to be saved like usual and perform like GPU model, walla!

autoencoder.compile() # since the GPU model is compiled, now only compile the base model
output = autoencoder.predict(img) # the output will be a mess since only the GPU model is trained, not the base model
output = parallel_autoencoder.predict(img) # the output is a clear image from well-trained GPU model
autoencoder.set_weights(parallel_autoencoder.get_weights()) # transfer the trained weights from GPU model to base model
output = autoencoder.predict(img) # perform the prediction again and the result is similar to the GPU model
autoencoder.save(‘CAE.h5’) # now the mode can be saved with transferred weights from GPU model.

The saved model can be loaded and modified as usual.
Hope it helps.

hinata9276 on 4 Nov 2017

👍3

Well, none of answers above helps at all.

Take a look at: the answer from fchollet.
https://github.com/fchollet/keras/issues/8446

He said "For now we recommend saving the original (template) model instead of the parallel model. I.e. call save on the model you passed to multi_gpu_model, not the model returned by it. Both models share the same weights."

This is my example code:
Please note that model -> template model
gpu_model -> multi_gpu_model
They are different.

# ------------- model ----------------------------
model = Sequential()
model.add(Conv2D(32, (5, 5), padding='same',
                 input_shape=x_train.shape[1:]))
model.add(Activation('relu'))
model.add(Conv2D(32, (5, 5)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

# ------------------- pass the the template model to gpu ---------------------
if ngpus > 1:
    gpu_model = multi_gpu_model(model,ngpus)

gpu_model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')

datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=0,  # randomly rotate images in the range (degrees, 0 to 180)
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=True,  # randomly flip images
        vertical_flip=False)  # randomly flip images

datagen.fit(x_train)

# Fit the model on the batches generated by datagen.flow().
gpu_model.fit_generator(datagen.flow(x_train, y_train,
                                 batch_size=batch_size),
                    steps_per_epoch=int(np.ceil(x_train.shape[0] / float(batch_size))),
                    epochs=nb_epoch,
                    validation_data=(x_test, y_test),
                    workers=2)

score = gpu_model.evaluate(x_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

# ------------ save the template model rather than the gpu_mode ----------------
# serialize model to JSON
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
# serialize weights to HDF5
model.save_weights("model.h5")
print("Saved model to disk")

# -------------- load the saved model --------------
from keras.models import model_from_json

# load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded model from disk")

# evaluate loaded model on test data
loaded_model.compile(loss='categorical_crossentropy',
                     optimizer='adadelta',
                     metrics=['accuracy'])
score = loaded_model.evaluate(x_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

Weixing-Zhang on 26 Nov 2017

👍21 ❤4 👎1

@Weixing-Zhang Thanks a lot. This works!

ysyyork on 27 Nov 2017

@Weixing-Zhang answer did not work for me, I was trying to save the weights with a callback function and if it did save the weights during the training, when I was loading them, I had an error basically saying that I was trying to load 1 weight into 34.
Did I do something wrong ? probably, but just a warning to everyone, if you load your weights by name, you won't see the error I had (if you have it) and the previous answers will look like they are working (with disastrous predictions) .
Anyway, here is the code I use for the callback (a mix of various previous answers), it may be of use to someone:

class CustomModelCheckpoint(Callback):

    def __init__(self, model_parallel, path):

        super().__init__()

        self.model = model_parallel
        self.path = path

        # Put your model here
        self.model_for_saving = SSD(num_classes=(NUM_CLASSES), weights='../data/weights/weights_300x300_old.hdf5')

    def on_epoch_end(self, epoch, logs=None):

        loss = logs['val_loss']
        self.model_for_saving.set_weights(self.model.get_weights())

        print("\nSaving model to : {}".format(self.path.format(epoch=epoch, val_loss=loss)))
        self.model_for_saving.save_weights(self.path.format(epoch=epoch, val_loss=loss), overwrite=True)

# Setting the callback functions
checkpointsString = "path/to/save/" + 'weights.{epoch:02d}-{val_loss:.2f}.hdf5'
callbacks = [CustomModelCheckpoint(model_parallel, checkpointsString), keras.callbacks.LearningRateScheduler(schedule)]

history = model.fit_generator(...,
                              callbacks=callbacks,
                              ...)

D3lt4lph4 on 4 Dec 2017

👍4

@D3lt4lph4
Could you try the code I just added for loading saved model and weight? Sorry for the confusion.

Weixing-Zhang on 4 Dec 2017

@D3lt4lph4 Hey, I am also getting similar error while I am trying to add multiple callbacks to multi_gpu_model. I would suggest you have a look at this issue, which has been raised by me#8764. I am also approaching the problem the way you have tried. Do you have any progress or suggestion at this point of time.

nbansal90 on 24 Dec 2017

Sorry for the delay in the answer, I was a bit busy with other problems upstream in my pipeline.

@Weixing-Zhang I just tested your code without the callback to save the weights and it works perfectly fine. I'll try again your code inside a callback function, maybe I missed something last time.

@nbansal90
The code I posted works for me (sorry if I wasn't clear on that). In your case, does each callback function works correctly on its own ?

D3lt4lph4 on 26 Dec 2017

@D3lt4lph4 Thanks for Your Prompt reply! Actually I tried out the same way, writing callbacks,defined in a separate class. I would request you to go through this issuehttps://github.com/keras-team/keras/issues/8764. Now In this I have multiple Callbacks defined. Now What I observe is for 1 callback, it works fine, but when I declare multiple callbacks, I get an error.
Since, you say It is working for you, May be You could give it a try, on your code, for more than 2 callbacks and see if it still works. If you think I have done a mistake, Please correct me.

nbansal90 on 27 Dec 2017

Okay, so I think I know why @Weixing-Zhang solution wasn't working with the checkpoints.

I did a little digging in the keras github and it seems that when the call to fit_generator is made, the model in the callback is set to be the model making the call to fit_generator. So even if the correct model is set beforehand when creating the callback, this will be overwritten by the multi_gpu one.

So here is the modify version of the previous Class

class CustomModelCheckpoint(Callback):

    def __init__(self, model, path):

        super().__init__()

        # This is the argument that will be modify by fit_generator
        # self.model = model
        self.path = path

        # We set the model (non multi gpu) under an other name
        self.model_for_saving = model

    def on_epoch_end(self, epoch, logs=None):

        loss = logs['val_loss']
        # Here we save the original one
        print("\nSaving model to : {}".format(self.path.format(epoch=epoch, val_loss=loss)))
        self.model_for_saving.save_weights(self.path.format(epoch=epoch, val_loss=loss), overwrite=True)

# Setting the callback functions
checkpointsString = args.checkpoints + 'weights.{epoch:02d}-{val_loss:.2f}.hdf5'
callbacks = [CustomModelCheckpoint(model, checkpointsString), keras.callbacks.LearningRateScheduler(schedule)]

gpu_model.compile(...)

# The call here will use the set_model() function of the callback to set the model, but since we do not use this model for saving, all good
history = gpu_model.fit_generator(...)

I guess this is a bug regarding the multi_gpu (and not a nice one to correct from what I see of the keras code).

D3lt4lph4 on 29 Dec 2017

👍2

@D3lt4lph4 That's a great find! I will try this set up. But Isn't this kind of weird,as all other callbacks (apart from Modelcheckpoint) are working perfectly fine,(with or without fit_generator), as I have verfied them working fine. Btw, Thank You for the set up, I will try this one out!

nbansal90 on 30 Dec 2017

@nbansal90 I don't really like to guess, but from what I read it would make sense for it to be a problem with the spreading of the weights on different gpus.
The loggers access data calculated at each epoch, thus it makes sense, for them, not to throw any errors. And the same goes for the update functions, the rate are shared so it is logical (probably) for them not to throw any errors.
Now when you save the weights you do not want the spread ones, and maybe the multigpu model has no idea of what the final weights look like thus causing the error at some point (again pure guessing).
I'll try to take a look at the keras implementation when I have some time.

D3lt4lph4 on 2 Jan 2018

I've found the workaround. See this StackOverflow answer for details. The code for multi_gpu_model:

from keras.layers import Lambda, concatenate
from keras import Model

import tensorflow as tf

def multi_gpu_model(model, gpus):
  if isinstance(gpus, (list, tuple)):
    num_gpus = len(gpus)
    target_gpu_ids = gpus
  else:
    num_gpus = gpus
    target_gpu_ids = range(num_gpus)

  def get_slice(data, i, parts):
    shape = tf.shape(data)
    batch_size = shape[:1]
    input_shape = shape[1:]
    step = batch_size // parts
    if i == num_gpus - 1:
      size = batch_size - step * i
    else:
      size = step
    size = tf.concat([size, input_shape], axis=0)
    stride = tf.concat([step, input_shape * 0], axis=0)
    start = stride * i
    return tf.slice(data, start, size)

  all_outputs = []
  for i in range(len(model.outputs)):
    all_outputs.append([])

  # Place a copy of the model on each GPU,
  # each getting a slice of the inputs.
  for i, gpu_id in enumerate(target_gpu_ids):
    with tf.device('/gpu:%d' % gpu_id):
      with tf.name_scope('replica_%d' % gpu_id):
        inputs = []
        # Retrieve a slice of the input.
        for x in model.inputs:
          input_shape = tuple(x.get_shape().as_list())[1:]
          slice_i = Lambda(get_slice,
                           output_shape=input_shape,
                           arguments={'i': i,
                                      'parts': num_gpus})(x)
          inputs.append(slice_i)

        # Apply model on slice
        # (creating a model replica on the target device).
        outputs = model(inputs)
        if not isinstance(outputs, list):
          outputs = [outputs]

        # Save the outputs for merging back together later.
        for o in range(len(outputs)):
          all_outputs[o].append(outputs[o])

  # Merge outputs on CPU.
  with tf.device('/cpu:0'):
    merged = []
    for name, outputs in zip(model.output_names, all_outputs):
      merged.append(concatenate(outputs,
                                axis=0, name=name))
    return Model(model.inputs, merged)

Plus, while loading the model, pass in the tensorflow object, like this:

model = load_model('multi_gpu_model.h5', {'tf': tf})

When the bug is fixed in keras, you'll only need to import the right multi_gpu_model.

maxim5 on 2 Jan 2018

👍7 ❤3 🎉3

@maxim5 nice ! Any explanation as to why this works ?

Edit

Actually is anyone able to get a working network with this method (more than one layer)? I mean it saves the weights but I get shitty results in the end, do you get good results this way ?
I'm curious, I used the load_weights method instead of the load_model and I'd like to know if this is the load_weights that is bugged or the previous solution that just hides the problem ? (Or a mistake on my side, which is also possible)

D3lt4lph4 on 3 Jan 2018

@D3lt4lph4 the problem is with this line in the keras code, as already discussed above:

def multi_gpu_model(model, gpus):
  ...
  import tensorflow as tf
  ...

This creates a closure for the get_slice lambda function, which includes the number of gpus (that's ok) and tensorflow module (not ok). Model save tries to serialize all layers, including the ones that call get_slice and fails exactly because tf is in the closure.

My solution is to move import out of multi_gpu_model, so that tf becomes a global object, though still needed for get_slice to work. This fixes the problem of saving, but in loading one has to provide tf explicitly. I'm sure the last part can be done by keras itself to make it look seamless.

maxim5 on 3 Jan 2018

👍3

@maxim5 Thanks for the explanation ! I think I understand the problem, but then, why is there no problems when using the model for training, the closure problem should still be there no ? (Sorry for all the questions)

Also, in order to correct this wouldn't it be possible to keep a reference to the original model in the multi_gpu one and then make the saving functions of the multi_gpu model call the saving functions of the original model ? It wouldn't solve the problem itself, but at least hide everything from the end user.

D3lt4lph4 on 3 Jan 2018

The problem is with serialization. Keras version works, until the model is dumped into a json.

Of course, keeping the reference to the original model solves it. But my intention was to be able to load the model from the disk. You can't keep a reference on the disk.

maxim5 on 4 Jan 2018

Okay thanks.

I may be missing something but I never mentioned saving a reference, it would make no sense. But since you can save a multi_gpu model through the non gpu one there could some kind of "translation" (no idea if there is a word for that) to use the saving function.
Actually it was more of a hidden question like "Do you think this would be accepted as a pull request ?" ^^.

D3lt4lph4 on 4 Jan 2018

Sorry, I misunderstood your phrase:

wouldn't it be possible to keep a reference to the original model in the multi_gpu ...

I'd make it a pull request, if load_model worked seamlessly. There are different ways to do so and at this point I'm not sure what the "keras way" is. I'll gladly discuss it with the team.

maxim5 on 4 Jan 2018

Why do we need to save model while training ? The model is not changing at all while training Am I wrong ? Just init your ModelCheckpoint with save_weights_only=True .

stavBodik on 7 Jan 2018

@stavBodik Well, you may want to stop your training mid way, in which case you may want to save your whole model. Also, just setting save_weights_only=True will not save the weights correctly unless you redifined multi_gpu_model or use a similar trick to what I did (unless I am being very unlucky and the model I use causes a bug). In which case you can save either your weights or the whole model, making no real difference in the end.

@maxim5 Do you know people on the keras team ? Or is there any way to ask them, I did not find anything to do so.

D3lt4lph4 on 7 Jan 2018

@D3lt4lph4 Thanks for the answer ,

Well, you may want to stop your training mid way, in which case you may want to save your whole model.

This is exactly my question , what do you mean in mid way ? the model is constant.
Do you mean stopping the train when the weights are calculated half way in the layers of the model ?
If so ,that's not a problem for me to begin the same epoch again...Am I right ? If I have the weights until epoch x , I can stop and continue from epoch x because I saved the model once at the beginning of the train . ( by the way when saying "model" I mean only the structure of the model (layers and there type))

Also, just setting save_weights_only=True will not save the weights correctly unless you redifined multi_gpu_model

Do you mean Keras have bug ? what's the point of releasing multi GPU functionality when you can't save the weights correctly ?

All my question is because of the need for training using :
keras.utils.training_utils import multi_gpu_model

function .

Thanks !

stavBodik on 7 Jan 2018

@stavBodik
What I mean by mid way is, let's say you're at the epoch 50/100 and you stop here. Yes you can that, but let's say you have an application using a model for some prediction and you want to test different models it much more easier to just have to input the whole stuff rather than redefining the model in your code and then loading the weights.
But in the end this could be a endless debate because it strongly depends on each case, like if you're not sharing the gpu resources you can make with only saving the weights, if you have to share it can be troublesome, etc.

Do you mean Keras have bug ? what's the point of releasing multi GPU functionality when you can't save the weights correctly ?

Yes ! I will try to summaries everything from the start.

There is a bug with multi_gpu_model which prevent you from saving the model (that's @maxim5 explanation). The keras team is aware of that, that is why they recommend to save on the non "multi gpued" model (that's @Weixing-Zhang answer). Now when you want to set a callback function to save the weigths (because when you train your model for the whole day and more you don't just want to hope that everything will go smoothly), you logically redefine the ModelCheckpoint function to save on the non-multi-gpu model, but (that's where my answer comes in) if you reset self.model in your new callback, it won't do because keras reset (again) that variable to the "gpued" model. Now, with @maxim5 answer you avoid all of that by removing the bug itself.
And I just remembered that I read something about this correction not being implemented because it would load tensorflow whereas keras was/is meant to be multi-backend (if I remembered well) .

Now for information, when you simply use the save_weight function you get the error described by kuba-lilz#8253 when you reload afterwards :

ValueError: You are trying to load a weight file containing 1 layers into a model with 19 layers.

I think that's it, if I'm missing something please anyone correct me.

Hope this helps !

D3lt4lph4 on 7 Jan 2018

@D3lt4lph4
What I mean by mid way is, let's say you're at the epoch 50/100 and you stop here. Yes you can that, but let's say you have an application using a model for some prediction and you want to test different models it much more easier to just have to input the whole stuff rather than redefining the model in your code and then loading the weights.

But I am saving the model once before start the train (on the non "multi gpued" model) so I can load it in predict time , Why I should save it again while training ?

The parallelism should be only on the forward and backward pass on the summation part of the gradients calculation in each iteration , the weights matrix structure should stay same as we would train with single GPU (Concatenate the results on CPU into one big batch In order to save weights and its working, Why I should care about save the model while training? the weights save is working correctly ).

"- Divide the model's input(s) into multiple sub-batches.
- Apply a model copy on each sub-batch. Every model copy
is executed on a dedicated GPU.
- Concatenate the results (on CPU) into one big batch I understand that this step is bugged ? the
weights are not saved correctly ? they forgot to merge or average the weights between GPU's?
"

The weights are loaded after the multi GPU step they has nothing to do with the sum parallelism .
Why I should save the model while training ?

if use_multi_gpu:
model = multi_gpu_model(model, gpus = NUM_OF_GPUS)
if load_weights:
model.load_weights(weights_to_load_path, by_name=True)

stavBodik on 7 Jan 2018

@stavBodik
Ho, sorry I did not understand it that way, well then yes, you have no reason to re-save it. I actually never though of doing it this way, thanks for the idea :p.

Okay, so i just saw that :

model.load_weights(weights_to_load_path, by_name=True)

you should remove the by_name=True, it can hide the error by only loading one layer into your model. I'm guessing you're not getting good prediction results when reloading your saved weights ?

I understand that this step is bugged ? the weights are not saved correctly ? they forgot to merge or average the weights between GPU's?
Look at maxim5 answers, he explained it quite well, there is a problem when serializing to save your model/weigths.

Why I should save the model while training ?
With your solution no real reason, it's a pure conception/will/situation problem and there is no good/bad answer.

D3lt4lph4 on 7 Jan 2018

@D3lt4lph4 Thanks again,

you should remove the by_name=True, it can hide the error by only loading one layer

Thanks for the great advice .

I'm guessing you're not getting good prediction results when reloading your saved weights ?

I will see it tomorrow , meanwhile I can see that the validation loss is decreasing well, But I guess that the weights calculated after the merge between GPU's are wrong.
(in the "Concatenate the results (on CPU) into one big batch" step)

I don't believe it is a problem in the save/serialize to/from JSON because if the merge of weights is good , the saving to file process (on CPU) should stay the same.

Look at maxim5 answers, he explained it quite well, there is a problem when serializing to save your model/weigths.

I did read it already , saying the words serializing and JSON don't explains the real bug .

With your solution no real reason, it's a pure conception/will/situation problem and there is no good/bad answer.

Exactly , this boolean is not usful and only confusing and made for some other strange/unknown/misunderstood reason.

stavBodik on 7 Jan 2018

@D3lt4lph4 Do you think that the weights saved by the GPU models after merge step are different from the CPU model ?
But @fchollet said : "Both models share the same weights."
I did not checked the results yet , But I am afraid that I should care saving the weights using the CPU model , because as you mentioned before :

Inside fit function's:

        # it's possible to callback a different model than self
        # (used by Sequential models)
        if hasattr(self, 'callback_model') and self.callback_model:
            callback_model = self.callback_model
        else:
            callback_model = self 

        callbacks.set_model(callback_model)

callback_model = self , sets the GPU model in callbacks (that's could be a problem) .

stavBodik on 8 Jan 2018

So I guess this should work :

modelGPU.__setattr__('callback_model',modelCPU)
#now we can train as normal and the weights saving in our callbacks will be done by the CPU model
modelGPU.fit_generator( . . .

stavBodik on 8 Jan 2018

👍5 ❤2 🎉1

@stavBodik There arised an Error by setting "modelGPU.__setattr__('callback_model',modelCPU)".
Traceback (most recent call last): File "retrain_full.py", line 724, in <module> main() File "retrain_full.py", line 595, in main initial_epoch=current_epoch) File "/data/project/anaconda2/lib/python2.7/site-packages/keras/legacy/interfaces.py", line 87, in wrapper return func(*args, **kwargs) File "/data/project/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 2111, in fit_generator callbacks.on_epoch_begin(epoch) File "/data/project/anaconda2/lib/python2.7/site-packages/keras/callbacks.py", line 59, in on_epoch_begin callback.on_epoch_begin(epoch, logs) File "/data/project/anaconda2/lib/python2.7/site-packages/keras/callbacks.py", line 570, in on_epoch_begin if not hasattr(self.model.optimizer, 'lr'): AttributeError: 'Model' object has no attribute 'optimizer'
If you know reason, please reply me, thank you!

yunkchen on 3 Apr 2018

I also faced a multi_gpu issue when trying to resume the model: Resume training with multi_gpu_model in Keras

saulam on 17 Jul 2018

@maxim5 hello, I used the way you said , it went well at the first epoch, but didn't work at the beginning of the second epoch and the error is 'InternalError (see above for traceback): CUB reduce errorinvalid configuration argument
[[Node: metrics/acc/Mean_1 = Mean[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](metrics/acc/Mean, metrics/acc/Const)]]
[[Node: metrics/acc/Mean_1/_467 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3785_metrics/acc/Mean_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
'

Taylor-Rose on 19 Jul 2018

I have a simplified callback for doing model checkpointing for multiple/single gpu models.

from keras.models import Model
from keras.callbacks import ModelCheckpoint

class MultiGPUCheckpoint(ModelCheckpoint):

    def set_model(self, model):
        if isinstance(model.layers[-2], Model):
            self.model = model.layers[-2]
        else:
            self.model = model

Because this inherits from ModelCheckpoint, you can use it in place of the ModelCheckpoint callback during fit/fit_generator.

model.fit(X_train, y_train,
    callbacks=[
        MultiGPUCheckpoint('model.h5', save_best_only=True)
    ]
)

Of course if you have a custom model containing another model in the second to last layer, this method is not going to do what you want.

jjangsangy on 1 Aug 2018

👍3 ❤1 🎉1

I'm looking forward to trying this @jjangsangy - thank you!

drsxr on 8 Aug 2018

@yunkchen I guess you need to compile first

import tensorflow as tf
with tf.device("/cpu:0"):
    model = build_model()

gpu_model = multi_gpu_model(model, gpus=2)
gpu_model.compile(optimizer=Adam(lr=0.0001,clipnorm = 0.5),loss=binary_accuracy)
gpu_model.__setattr__('callback_model',model)

ChristofHenkel on 28 Aug 2018

❤1 👍1

Hi,
I have tried some(/all) of the solutions above, chances are that I use them wrong, but these are unpleasant experiences. So, is there will be any OFFICIAL support for model checkpoint saving with ease? To me, this is just necessary and make no scene to leave this issue unsolved for such a long time.

MiaoDX on 19 Sep 2018

👍2

Had the same problem. Solved using #11313 .

thovdamm on 12 Oct 2018

Closing as this is resolved

wt-huang on 2 Nov 2018

in my experience the issue is as clearly explained by @maxim5 so it's this import https://github.com/keras-team/keras/blob/d059890d0342955e968fdf97b5a90d19c9d68b4e/keras/utils/multi_gpu_utils.py#L173

loretoparisi on 27 Nov 2018

in my experience
should use "from keras.utils import multi_gpu_model" to import mutil_gpu_model and
not "from keras.utils.training_utils import multi_gpu_model"

my environment is：

>>> import tensorflow as tf 
>>> import keras as K 
Using TensorFlow backend.
>>> tf.__version__
'1.11.0-rc0'
>>> K.__version__
'2.2.2'

I am getting the error log when saving the model：

Traceback (most recent call last):
  File "scripts/train.py", line 169, in <module>
    train()
  File "scripts/train.py", line 164, in train
    verbose=2)
  File "/usr/local/lib/python2.7/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1415, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training_generator.py", line 247, in fit_generator
    callbacks.on_epoch_end(epoch, epoch_logs)
  File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 77, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
  File "/usr/local/lib/python2.7/dist-packages/keras/callbacks.py", line 444, in on_epoch_end
    self.model.save(filepath, overwrite=True)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 1085, in save
    save_model(self, filepath, overwrite, include_optimizer)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/saving.py", line 116, in save_model
    'config': model.get_config()
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/network.py", line 926, in get_config
    return copy.deepcopy(config)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/usr/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/usr/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/usr/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/usr/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
TypeError: can't pickle NotImplementedType objects

Reference official document:
1 https://github.com/keras-team/keras/blob/d059890d0342955e968fdf97b5a90d19c9d68b4e/keras/utils/multi_gpu_utils.py#L69
2 https://keras.io/getting-started/faq/#how-can-i-run-a-keras-model-on-multiple-gpus

from keras.utils import multi_gpu_model

# Replicates `model` on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
                       optimizer='rmsprop')

# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)

Therefore, the explanation of https://github.com/keras-team/keras/issues/8123#issuecomment-355049313 should be equivalent to my experience but the Keras source code

def multi_gpu_model(model, gpus):
  ...
  import tensorflow as tf
  ...

should be ok.

zt706 on 3 Dec 2018

@maxim5 You said that the issue will be solved but I don't know if it is. Is it solved?

I am facing a similar issue with saving/loading multi_gpu_model here

ParikhKadam on 14 Feb 2019

👍1

@ParikhKadam I'm not in the keras team and I didn't promise the fix. Based on latest activity here it has been fixed, but unfortunately I don't have any additional information. Please ask the team.

maxim5 on 15 Feb 2019

👍1

@maxim5 Thank you.. Atleast you replied. I have tried contacting the team first but they never reply here.. Though, I will solve my problem either way. Thank you..

ParikhKadam on 17 Feb 2019

@ParikhKadam yes, they usually don't. Did you try to contact them on the Slack channel? I don't think it will make a difference, but at least you'll leave no stone unturned.

AndreaPi on 17 Feb 2019

@AndreaPi Yes.. tried once before some months but no response. I am still using Keras just because I am about to complete my whole project in this framework. I would then switch to Torch/Caffe.

ParikhKadam on 17 Feb 2019

I have a simplified callback for doing model checkpointing for multiple/single gpu models.
from keras.models import Model
from keras.callbacks import ModelCheckpoint

class MultiGPUCheckpoint(ModelCheckpoint):

    def set_model(self, model):
        if isinstance(model.layers[-2], Model):
            self.model = model.layers[-2]
        else:
            self.model = model
Because this inherits from ModelCheckpoint, you can use it in place of the ModelCheckpoint callback during fit/fit_generator.
model.fit(X_train, y_train,
    callbacks=[
        MultiGPUCheckpoint('model.h5', save_best_only=True)
    ]
)
Of course if you have a custom model containing another model in the second to last layer, this method is not going to do what you want.

I had the same problem.
But with keras 2.2.4 and tensorflow 1.12, just checkpoint the weights using keras ModelCheckpoint.
Then when you want to load the weights, just load the architecture(can be json or yml),convert the loaded architecture into multi_gpu model(with same config) using Keras multi_gpu class and then load the weights.
This works fine.

aakash-saboo on 24 Feb 2019

So I guess this should work :

modelGPU.__setattr__('callback_model',modelCPU)
#now we can train as normal and the weights saving in our callbacks will be done by the CPU model
modelGPU.fit_generator( . . .

@stavBodik I went through almost all the solutions and from my point of view, your solution is the best. This is because I want to use the default keras callback to save the model weights. I just verified it with dummy training and loading the model in the test script.

However, just one small issue. After loading the model in the test script, I get the following warning:

..\lib\site-packages\keras\engine\saving.py:292: UserWarning: No training configuration found in save file: the model was not compiled. Compile it manually.
warnings.warn('No training configuration found in save file: '

I have no issues compiling it. However, do you know if there's a fix for this without needing to do that manually? It will save some headache of saving additional meta-information of each training run.

abdullahshafin on 20 Mar 2019

Seems like that this issue not solved for keras 2.1.2.
Refer to ChristofHenkel's reply, model could be saved during training, but format of the saved model is not correct when load for inference.

cvtower on 24 Apr 2019

👍1

I Solved this problem using following updates..!
we need to use the multi-GPU model on our other callbacks for performance reasons, but we also need the template model for ModelCheckpoint and some other callbacks. For that reason, we made a tiny adapter called AltModelCheckpoint to wrap ModelCheckpoint with the checkpointed model being explicitly specified.

Installation is easy
pip install alt-model-checkpoint
from alt_model_checkpoint import AltModelCheckpoint
from keras.models import Model
from keras.utils import multi_gpu_model
base_model = Model(...)
gpu_model = multi_gpu_model(base_model)
gpu_model.compile(...)
gpu_model.fit(..., callbacks=
AltModelCheckpoint('save/path/for/model.hdf5',
base_model)
])

Enjoy.....! :)

bordeprashant on 25 Apr 2019

In case anyone still has the same issue, I've got a hilarious workaround.

Check my code and use the same function for saving the model while tuning it on multi GPUs.
It'll save the model, but you can't use that model to fine-tune it on multi GPUs again.
But:

load the checkpoint on single gpu and make the model do several steps and save the new model (basically update the one).
now try to load the newly generated model weights into multi gpu script. And, it'll continue training.

Keras version: 2.2.4
Tensorflow-gpu: 1.13.1

    model = Model(init, out, name='Inception-v4')

    if check == True:
        weights = checkpoint_path
        model.load_weights(weights, by_name=True)
        print("Model weights loaded.")

    return model

model = create_inception_v4(load_weights=check)

if int(args['gpus']) > 1:
    model = multi_gpu_model(model, gpus=int(args['gpus']))

model.summary()

train_dir = str(args['train_dir'])
val_dir = str(args['val_dir'])

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

datagen=ImageDataGenerator(rescale=1/255,
            rotation_range=40,
            width_shift_range=0.1,
            height_shift_range=0.1,
            shear_range=0.1,
            zoom_range=0.1,
            horizontal_flip=True,
            fill_mode='nearest',
            samplewise_std_normalization=True)

val_datagen = ImageDataGenerator(rescale=1/255)

train_generator = datagen.flow_from_directory(train_dir,target_size=(299,299),class_mode="categorical")
val_gen = datagen.flow_from_directory(val_dir,target_size=(299,299),class_mode="categorical")

mc = keras.callbacks.ModelCheckpoint("inceptionv4_checkpoints/InceptionV4.h5",save_best_only=True, save_weights_only=True)
tensorboard = TensorBoard(log_dir="{}/{}".format(args["log_dir"], time()))


model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.SGD(lr=float(args['learning_rate']), decay=1e-6, momentum=0.9, nesterov=True), metrics=["accuracy"])
hist = model.fit_generator(train_generator,steps_per_epoch=int(args['steps_per_epoch']),epochs=int(args['epochs']),verbose=True,validation_data=val_gen,validation_steps=10,callbacks=[mc, tensorboard])

deepconsc on 11 Aug 2019

If you save the base model, not the multi-gpu model, then you can load it as usual and construct the multi-gpu model from it as usual, and its training takes off from where it was left. At least, it works for me.

caiuspetronius on 4 Oct 2019

I have a problem with loading base model weights on the multi-gpu model. I trained my base model in a single-gpu mode (without using multi_gpu_model). But, when I load the checkpoints to continue my training using multi_gpu_model (by just loading base model weights from the checkpoint files), it seems the weights are not applied and the model starts to train with initial weights.
Is there any idea on how to use base model checkpoints generated in default single-gpu mode to continue using multi_gpu_model?