Keras: Model Checkpoint Does not work with multi-gpu-model

Created on 21 Dec 2017 · 5Comments · Source: keras-team/keras

keras.utils.multi-gpu-model(model,5) will not work well with ModelCheckpoint callback. It throws a "cannot serialize IO object error." I guess I understand why this might is happening since multiple copies of the same model span my gpus but I am not sure how to fix it.

Any workarounds? It works awesome otherwise.

EDIT: Closing this issue. Saving weights works just fine.

Source

pGit1

👍1

Most helpful comment

I Solved this problem using following updates..!
we need to use the multi-GPU model on our other callbacks for performance reasons, but we also need the template model for ModelCheckpoint and some other callbacks. For that reason, we made a tiny adapter called AltModelCheckpoint to wrap ModelCheckpoint with the checkpointed model being explicitly specified.

Installation is easy

pip install alt-model-checkpoint

from alt_model_checkpoint import AltModelCheckpoint
from keras.models import Model
from keras.utils import multi_gpu_model
base_model = Model(...)
gpu_model = multi_gpu_model(base_model)
gpu_model.compile(...)
gpu_model.fit(..., callbacks=
AltModelCheckpoint('save/path/for/model.hdf5',
base_model)
])

Enjoy.....! :)

bordeprashant on 25 Apr 2019

👍3

All 5 comments

pengpaiSH on 27 Dec 2017

@pGit1 Hi, did the ModelCheckpoint callback can be used to save the checkpoint of the multi-gpu model correctly now?

hellojialee on 7 Feb 2018

Yes it does. Refer to this: https://github.com/keras-team/keras/issues/2436#issuecomment-354882296

This should help you.

pGit1 on 10 Feb 2018

I solved the problem using the following way. I changed some lines in the major codes of keras (particularly in topology.py/network.py and callbacks.py). Here, I just modified the following codes.

Reminder: You need to replace 'saving.save_weights_to_hdf5_group' with 'save_weights_to_hdf5_group(f, layers)' if you use an older version of Keras.

network.py:

def save_weights(self, filepath, overwrite=True, multiple_gpu=False, name_of_model=""):
    """Dumps all layer weights to a HDF5 file.
   name_of_model is usually model_1, you can check the name of the model by calling summary after running multi_gpu_model

    The weight file has:
        - `layer_names` (attribute), a list of strings
            (ordered names of model layers).
        - For every layer, a `group` named `layer.name`
            - For every such layer group, a group attribute `weight_names`,
                a list of strings
                (ordered names of weights tensor of the layer).
            - For every weight in the layer, a dataset
                storing the weight value, named after the weight tensor.

    # Arguments
        filepath: String, path to the file to save the weights to.
        overwrite: Whether to silently overwrite any existing file at the
            target location, or provide the user with a manual prompt.

    # Raises
        ImportError: If h5py is not available.
    """
    if h5py is None:
        raise ImportError('`save_weights` requires h5py.')
    # If file exists and should not be overwritten:
    if not overwrite and os.path.isfile(filepath):
        proceed = ask_to_proceed_with_overwrite(filepath)
        if not proceed:
            return
    with h5py.File(filepath, 'w') as f:
        if multiple_gpu and name_of_model is not None:
            layers = self.get_layer(name_of_model)
            layers = layers.layers
            saving.save_weights_to_hdf5_group(f, layers)
        else:
            saving.save_weights_to_hdf5_group(f, self.layers)
        f.flush()

callback.py:
class ModelCheckpoint(Callback):
"""Save the model after every epoch.

`filepath` can contain named formatting options,
which will be filled with the values of `epoch` and
keys in `logs` (passed in `on_epoch_end`).

For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
then the model checkpoints will be saved with the epoch number and
the validation loss in the filename.

# Arguments
    filepath: string, path to save the model file.
    monitor: quantity to monitor.
    verbose: verbosity mode, 0 or 1.
    save_best_only: if `save_best_only=True`,
        the latest best model according to
        the quantity monitored will not be overwritten.
    mode: one of {auto, min, max}.
        If `save_best_only=True`, the decision
        to overwrite the current save file is made
        based on either the maximization or the
        minimization of the monitored quantity. For `val_acc`,
        this should be `max`, for `val_loss` this should
        be `min`, etc. In `auto` mode, the direction is
        automatically inferred from the name of the monitored quantity.
    save_weights_only: if True, then only the model's weights will be
        saved (`model.save_weights(filepath)`), else the full model
        is saved (`model.save(filepath)`).
    period: Interval (number of epochs) between checkpoints.
"""

def __init__(self, filepath, monitor='val_loss', verbose=0,
             save_best_only=False, save_weights_only=False,
             mode='auto', period=1,
             multiple_gpu=False, name_of_model=None):
    super(ModelCheckpoint, self).__init__()
    self.monitor = monitor
    self.verbose = verbose
    self.filepath = filepath
    self.save_best_only = save_best_only
    self.save_weights_only = save_weights_only
    self.period = period
    self.epochs_since_last_save = 0
    self.multi_gpu_mode = multiple_gpu
    self.name_of_model = name_of_model

    if mode not in ['auto', 'min', 'max']:
        warnings.warn('ModelCheckpoint mode %s is unknown, '
                      'fallback to auto mode.' % (mode),
                      RuntimeWarning)
        mode = 'auto'

    if mode == 'min':
        self.monitor_op = np.less
        self.best = np.Inf
    elif mode == 'max':
        self.monitor_op = np.greater
        self.best = -np.Inf
    else:
        if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
            self.monitor_op = np.greater
            self.best = -np.Inf
        else:
            self.monitor_op = np.less
            self.best = np.Inf

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch + 1, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch + 1, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.model.save_weights(filepath, overwrite=True, multiple_gpu=self.multi_gpu_mode, name_of_model=self.name_of_model)
                    else:
                        self.model.save(filepath, overwrite=True)
                else:
                    if self.verbose > 0:
                        print('\nEpoch %05d: %s did not improve from %0.5f' %
                              (epoch + 1, self.monitor, self.best))
        else:
            if self.verbose > 0:
                print('\nEpoch %05d: saving model to %s' % (epoch + 1, filepath))
            if self.save_weights_only:
                self.model.save_weights(filepath, overwrite=True)
            else:
                self.model.save(filepath, overwrite=True)