Keras: How to save the model/weights trained by every epoch when using multi_gpu_model

Created on 12 Nov 2017  Â·  14Comments  Â·  Source: keras-team/keras

With multi_gpu_model, I used the following code (with tensorflow backend) to save the weights trained by each epoch:

model = Unet(...)
parallel_model = multi_gpu_model(model, gpus=4)

model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
model.save("./UNet.h5")

checkpoint = ModelCheckpoint('./weights.{epoch:04d}-{val_loss:.2f}.hdf5', save_weights_only=True)

logger = CSVLogger(os.path.join(".", "training.log"))
history = LossHistory()

parallel_model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
parallel_model.fit(x=x_train, y=y_train, validation_data=(x_valid, y_valid), epochs=n_epoches,     
                            batch_size=batch_size, verbose=2,
                            callbacks=[checkpoint,logger,history]   )

################in testing.py
model=Unet(...)
model.load_weights('./weights.0001-0.16.hdf5')

but it gives an error:

Traceback (most recent call last):
  File "testing.py", line 131, in <module>
    model.load_weights('weights.0001-0.16.hdf5')
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2622, in load_weights
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 3115, in load_weights_from_hdf5_group
ValueError: You are trying to load a weight file containing 1 layers into a model with 71 layers.

But it works if I just use model=load_model("./UNet.h5") where I think it uses the weights of the final epoch. However, I would like to see the weights of each epoch.

I also tried to save the model trained by each epoch by using the following code:

model = Unet(...)
parallel_model = multi_gpu_model(model, gpus=4)

model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
model.save("./UNet.h5")

checkpoint = ModelCheckpoint('./weights.{epoch:04d}-{val_loss:.2f}.hdf5')
logger = CSVLogger(os.path.join(".", "training.log"))
history = LossHistory()

parallel_model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
parallel_model.fit(x=x_train, y=y_train, validation_data=(x_valid, y_valid), epochs=n_epoches,     
                            batch_size=batch_size, verbose=2,
                            callbacks=[checkpoint,logger,history]   )

################in testing.py
model=load_model('./weights.0001-0.16.hdf5')

This time, it gives an error:

Traceback (most recent call last):
  File "testing.py", line 132, in <module>
    model=load_model("./weights.0001-0.16.hdf5")
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/models.py", line 240, in load_model
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/models.py", line 314, in model_from_cong
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/__init__.py", line 55, in deseriize
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 139, in derialize_keras_object
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2490, in fromonfig
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2476, in procs_layer
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/__init__.py", line 55, in deseriize
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 139, in derialize_keras_object
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/core.py", line 699, in from_conf
  File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 206, in fc_load
TypeError: arg 5 (closure) must be None or tuple

So, could you please give an example for saving the model/weights trained by each epoch when using multi_gpu_model?

Thanks.


Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.

Thank you!

  • [x] Check that you are up-to-date with the master branch of Keras. You can update with:
    pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps

  • [x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.

  • [ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
    pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps

  • [ ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).

Most helpful comment

Or you could just pass in the template model and then replace all save methods on self.model to the template model.

```
class MultiGPUCheckpointCallback(Callback):

def __init__(self, filepath, base_model, monitor='val_loss', verbose=0,
             save_best_only=False, save_weights_only=False,
             mode='auto', period=1):
    super(MultiGPU_Checkpoint_Callback, self).__init__()
    self.base_model = base_model
    self.monitor = monitor
    self.verbose = verbose
    self.filepath = filepath
    self.save_best_only = save_best_only
    self.save_weights_only = save_weights_only
    self.period = period
    self.epochs_since_last_save = 0

    if mode not in ['auto', 'min', 'max']:
        warnings.warn('ModelCheckpoint mode %s is unknown, '
                      'fallback to auto mode.' % (mode),
                      RuntimeWarning)
        mode = 'auto'

    if mode == 'min':
        self.monitor_op = np.less
        self.best = np.Inf
    elif mode == 'max':
        self.monitor_op = np.greater
        self.best = -np.Inf
    else:
        if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            self.monitor_op = np.less
            self.best = np.Inf

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch + 1, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('Epoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch + 1, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.base_model.save_weights(filepath, overwrite=True)
                    else:
                        self.base_model.save(filepath, overwrite=True)
                else:
                    if self.verbose > 0:
                        print('Epoch %05d: %s did not improve' %
                              (epoch + 1, self.monitor))
        else:
            if self.verbose > 0:
                print('Epoch %05d: saving model to %s' % (epoch + 1, filepath))
            if self.save_weights_only:
                self.base_model.save_weights(filepath, overwrite=True)
            else:
                self.base_model.save(filepath, overwrite=True)

All 14 comments

Dirty working solution:

def detachmodel(m):
    """ Detach model trained on GPUs from its encapsulation
    # Arguments
        :param m: obj, keras model
    # Returns
        :return: obj, keras model
    """
    for l in m.layers:
        if l.name == 'model_1':
            return l
    return m

Call this function inside Checkpoint callback:

class ModelCheckpointDetached(Callback):
    """ Save detached from multi-GPU encapsulation model
    (very small) modification from https://github.com/fchollet/keras/blob/master/keras/callbacks.py#L331

    `filepath` can contain named formatting options,
    which will be filled the value of `epoch` and
    keys in `logs` (passed in `on_epoch_end`).

    For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
    then the model checkpoints will be saved with the epoch number and
    the validation loss in the filename.

    # Arguments
        filepath: string, path to save the model file.
        monitor: quantity to monitor.
        verbose: verbosity mode, 0 or 1.
        save_best_only: if `save_best_only=True`,
            the latest best model according to
            the quantity monitored will not be overwritten.
        mode: one of {auto, min, max}.
            If `save_best_only=True`, the decision
            to overwrite the current save file is made
            based on either the maximization or the
            minimization of the monitored quantity. For `val_acc`,
            this should be `max`, for `val_loss` this should
            be `min`, etc. In `auto` mode, the direction is
            automatically inferred from the name of the monitored quantity.
        save_weights_only: if True, then only the model's weights will be
            saved (`model.save_weights(filepath)`), else the full model
            is saved (`model.save(filepath)`).
        period: Interval (number of epochs) between checkpoints.
    """

    def __init__(self, filepath, monitor='val_loss', verbose=0,
                 save_best_only=False, save_weights_only=False,
                 mode='auto', period=1):
        super(ModelCheckpointDetached, self).__init__()
        self.monitor = monitor
        self.verbose = verbose
        self.filepath = filepath
        self.save_best_only = save_best_only
        self.save_weights_only = save_weights_only
        self.period = period
        self.epochs_since_last_save = 0

        if mode not in ['auto', 'min', 'max']:
            warnings.warn('ModelCheckpoint mode %s is unknown, '
                          'fallback to auto mode.' % mode, RuntimeWarning)
            mode = 'auto'

        if mode == 'min':
            self.monitor_op = np.less
            self.best = np.Inf
        elif mode == 'max':
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
                self.monitor_op = np.greater
                self.best = -np.Inf
            else:
                self.monitor_op = np.less
                self.best = np.Inf

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.epochs_since_last_save += 1
        if self.epochs_since_last_save >= self.period:
            self.epochs_since_last_save = 0
            filepath = self.filepath.format(epoch=epoch, **logs)
            if self.save_best_only:
                current = logs.get(self.monitor)
                if current is None:
                    warnings.warn('Can save best model only with %s available, '
                                  'skipping.' % self.monitor, RuntimeWarning)
                else:
                    if self.monitor_op(current, self.best):
                        if self.verbose > 0:
                            print('Epoch %05d: %s improved from %0.5f to %0.5f,'
                                  ' saving model to %s'
                                  % (epoch, self.monitor, self.best,
                                     current, filepath))
                        self.best = current
                        if self.save_weights_only:
                            detachmodel(self.model).save_weights(filepath, overwrite=True)
                        else:
                            detachmodel(self.model).save(filepath, overwrite=True)
                    else:
                        if self.verbose > 0:
                            print('Epoch %05d: %s did not improve' %
                                  (epoch, self.monitor))
            else:
                if self.verbose > 0:
                    print('Epoch %05d: saving model to %s' % (epoch, filepath))
                if self.save_weights_only:
                    detachmodel(self.model).save_weights(filepath, overwrite=True)
                else:
                    detachmodel(self.model).save(filepath, overwrite=True)

@OliPhilip, Thank you, I will try it tomorrow.

Or you could just pass in the template model and then replace all save methods on self.model to the template model.

```
class MultiGPUCheckpointCallback(Callback):

def __init__(self, filepath, base_model, monitor='val_loss', verbose=0,
             save_best_only=False, save_weights_only=False,
             mode='auto', period=1):
    super(MultiGPU_Checkpoint_Callback, self).__init__()
    self.base_model = base_model
    self.monitor = monitor
    self.verbose = verbose
    self.filepath = filepath
    self.save_best_only = save_best_only
    self.save_weights_only = save_weights_only
    self.period = period
    self.epochs_since_last_save = 0

    if mode not in ['auto', 'min', 'max']:
        warnings.warn('ModelCheckpoint mode %s is unknown, '
                      'fallback to auto mode.' % (mode),
                      RuntimeWarning)
        mode = 'auto'

    if mode == 'min':
        self.monitor_op = np.less
        self.best = np.Inf
    elif mode == 'max':
        self.monitor_op = np.greater
        self.best = -np.Inf
    else:
        if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            self.monitor_op = np.less
            self.best = np.Inf

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch + 1, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('Epoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch + 1, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.base_model.save_weights(filepath, overwrite=True)
                    else:
                        self.base_model.save(filepath, overwrite=True)
                else:
                    if self.verbose > 0:
                        print('Epoch %05d: %s did not improve' %
                              (epoch + 1, self.monitor))
        else:
            if self.verbose > 0:
                print('Epoch %05d: saving model to %s' % (epoch + 1, filepath))
            if self.save_weights_only:
                self.base_model.save_weights(filepath, overwrite=True)
            else:
                self.base_model.save(filepath, overwrite=True)

@kevinfaust0308 this callback works perfectly for me! thanks!

@kevinfaust0308 Thank you! works like a charm.

@OliPhilip what is it mean that we are going to save the weights only from model number 1 ? is this weights are from all parallel models ? Thanks !

The correct way is to use the model that was compiled using CPU with the callbacks inside the GPU model :

modelGPU = multi_gpu_model(modelCPU, gpus = NUM_OF_GPUS)
modelGPU.__setattr__('callback_model',modelCPU)
modelGPU.fit_generator.....

@kevinfaust0308 Do you mean use it as checkpoint = MultiGPUCheckpointCallback('./weights.{epoch:04d}-{val_loss:.2f}.hdf5', save_weights_only=True) ?

@OliPhilip I had the same idea to extract the base model. I found the layer names unreliable in my case so I used the class. Of course this isn't going to work if you have multiple embedded models.

from keras.engine.training import Model
...

def extract_base_model(m):
    """ Extract base model if encapsulated.
    # Arguments
        :param m: obj, keras model
    # Returns
        :return: obj, keras model
    """
    for l in m.layers:
        if isinstance(l, Model):
            return l
    return m

@stavBodik The only problem with this is that all other callbacks will also use the "template" model, which may not be desirable (something about unintended side effects). It's too bad Keras didn't provision for an arg to the underlying Callback class that would allow specification of the model on a per-callback basis, but I think I might try to submit a change to facilitate that, if not for the base class then at least for ModelCheckpoint.

@evictor What is the status of this issue these days?

Using the latest keras version, I'm able to save the multigpu_model's weights. I can then successfully load them with load_weights into a similar multigpu model. But I'm struggling to transform these "multi_gpu weights" into "single GPU" or "CPU weights". What I have tried looks like this:

# set up my basic model
model.compile(...)
model_to_write = model

# write it up to reload it easily after the fitting (I like YAML)
model.to_yaml(open('model.yaml', 'w'))

# setting it up for multiple GPUs
model = keras.utils.multi_gpu_model(model_to_write, gpus=4, cpu_relocation=True)

model.fit(...,callbacks=[keras.callbacks.ModelCheckpoint(
                filepath='multi_weights.h5'
                save_best_only=True,
                save_weights_only=True,
                verbose=1)])

# reload the best weights of the multi_gpu_model
model.load_weights('multi_weights.h5')
# invoke save_weights on the original model
model_to_write.save_weights('weights.h5')
[...]
Epoch 00034: val_loss improved from -0.93578 to -0.93615, saving model to multi_weights.h5
[...]
Epoch 00084: val_loss did not improve from -0.93615
Epoch 00084: early stopping

After doing that, I tried, in another python session, to reload the weights and recompute the loss on my validation set.

model = keras.models.model_from_yaml(open('model.yaml', 'r'))
model.load_weights('weights.h5')
y_pred = model.predict(x_validation)
print(K.eval(loss(y_pred, y_validation))

This gave me very bad results, so I think calling the save_weights method on the original model did not work. I also tried model.load_weights('multi_weights.h5', by_name=True), which didn't raise errors but resulted in the same (awful) loss values.

However, I could find the loss of my "best" model this way:

multi_model = multi_gpu_model(model, gpus=4, cpu_relocation=True)
multi_model.load_weights('multi_weights.h5')
y_pred = multi_model.predict(x_validation)
print(K.eval(loss(y_pred, y_validation))

In fact, I even had slightly better values for the loss than in the standard output of the fitting script. I guess this is due to numerical weirderies…

To sum up, I trained a model on 4 GPUs and it works great, but I cannot find how to use the weights I save when I have 0 or 1 GPU available…

Why doesn't my way of reloading the best weights on the multigpu model and then saving the original model work? Should I use @OliPhilip's dirty solution? Any help is appreciated…

@truenicoco I made a little adapter to save an alternative model: https://github.com/TextpertAi/alt-model-checkpoint

I’m not sure I see the problem with your script but you are doing something weird loading weights after training like that. You shouldn’t have to reload weights in the script there; both the multi GPU model and your template model should have the latest weights after that fit call.

I think my alt model checkpoint adapter will suit you though because it will let you save the template model weights directly.

Hope this helps, on a mobile phone right now so can’t do a lot

Edit, blog post for more info: https://www.textpert.ai/aime-blog/saving-multi-gpu-models-with-keras-modelcheckpoint

I should have mentioned it, but the reason why I am reloading the weights is that I use save_best_only option and I believe it is critical in my case to avoid overfitting. Thank you for the link and for the package, even if I don't end up using it, it will definitely be useful.

Ah, I see. Well yes, alt-model-checkpoint should help you with this scenario—it will let the template model's best weights be saved which should then load in fine. I _think_ it should be a drop-in fix for you.

It's surprising to me that your ad hoc predictions were way off even though weight loading worked; I think usually if weights were messed up you would see errors during loading, leading me to believe there might be something else at play, like incorrect data passed to predict().

Was this page helpful?
0 / 5 - 0 ratings

Related issues

wx405557858 picture wx405557858  Â·  71Comments

henry0312 picture henry0312  Â·  162Comments

danFromTelAviv picture danFromTelAviv  Â·  63Comments

EderSantana picture EderSantana  Â·  219Comments

dipanjannag picture dipanjannag  Â·  265Comments