With multi_gpu_model, I used the following code (with tensorflow backend) to save the weights trained by each epoch:
model = Unet(...)
parallel_model = multi_gpu_model(model, gpus=4)
model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
model.save("./UNet.h5")
checkpoint = ModelCheckpoint('./weights.{epoch:04d}-{val_loss:.2f}.hdf5', save_weights_only=True)
logger = CSVLogger(os.path.join(".", "training.log"))
history = LossHistory()
parallel_model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
parallel_model.fit(x=x_train, y=y_train, validation_data=(x_valid, y_valid), epochs=n_epoches,
batch_size=batch_size, verbose=2,
callbacks=[checkpoint,logger,history] )
################in testing.py
model=Unet(...)
model.load_weights('./weights.0001-0.16.hdf5')
but it gives an error:
Traceback (most recent call last):
File "testing.py", line 131, in <module>
model.load_weights('weights.0001-0.16.hdf5')
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2622, in load_weights
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 3115, in load_weights_from_hdf5_group
ValueError: You are trying to load a weight file containing 1 layers into a model with 71 layers.
But it works if I just use model=load_model("./UNet.h5") where I think it uses the weights of the final epoch. However, I would like to see the weights of each epoch.
I also tried to save the model trained by each epoch by using the following code:
model = Unet(...)
parallel_model = multi_gpu_model(model, gpus=4)
model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
model.save("./UNet.h5")
checkpoint = ModelCheckpoint('./weights.{epoch:04d}-{val_loss:.2f}.hdf5')
logger = CSVLogger(os.path.join(".", "training.log"))
history = LossHistory()
parallel_model.compile(optimizer = Adam(lr = 2e-4), loss = 'categorical_crossentropy', metrics = ['categorical_accuracy'])
parallel_model.fit(x=x_train, y=y_train, validation_data=(x_valid, y_valid), epochs=n_epoches,
batch_size=batch_size, verbose=2,
callbacks=[checkpoint,logger,history] )
################in testing.py
model=load_model('./weights.0001-0.16.hdf5')
This time, it gives an error:
Traceback (most recent call last):
File "testing.py", line 132, in <module>
model=load_model("./weights.0001-0.16.hdf5")
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/models.py", line 240, in load_model
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/models.py", line 314, in model_from_cong
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/__init__.py", line 55, in deseriize
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 139, in derialize_keras_object
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2490, in fromonfig
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/engine/topology.py", line 2476, in procs_layer
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/__init__.py", line 55, in deseriize
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 139, in derialize_keras_object
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/layers/core.py", line 699, in from_conf
File "/workspace/hshu/anaconda3/lib/python3.5/site-packages/Keras-2.0.9-py3.5.egg/keras/utils/generic_utils.py", line 206, in fc_load
TypeError: arg 5 (closure) must be None or tuple
So, could you please give an example for saving the model/weights trained by each epoch when using multi_gpu_model?
Thanks.
Please make sure that the boxes below are checked before you submit your issue. If your issue is an implementation question, please ask your question on StackOverflow or join the Keras Slack channel and ask there instead of filing a GitHub issue.
Thank you!
[x] Check that you are up-to-date with the master branch of Keras. You can update with:
pip install git+git://github.com/fchollet/keras.git --upgrade --no-deps
[x] If running on TensorFlow, check that you are up-to-date with the latest version. The installation instructions can be found here.
[ ] If running on Theano, check that you are up-to-date with the master branch of Theano. You can update with:
pip install git+git://github.com/Theano/Theano.git --upgrade --no-deps
[ ] Provide a link to a GitHub Gist of a Python script that can reproduce your issue (or just copy the script here if it is short).
Dirty working solution:
def detachmodel(m):
""" Detach model trained on GPUs from its encapsulation
# Arguments
:param m: obj, keras model
# Returns
:return: obj, keras model
"""
for l in m.layers:
if l.name == 'model_1':
return l
return m
Call this function inside Checkpoint callback:
class ModelCheckpointDetached(Callback):
""" Save detached from multi-GPU encapsulation model
(very small) modification from https://github.com/fchollet/keras/blob/master/keras/callbacks.py#L331
`filepath` can contain named formatting options,
which will be filled the value of `epoch` and
keys in `logs` (passed in `on_epoch_end`).
For example: if `filepath` is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`,
then the model checkpoints will be saved with the epoch number and
the validation loss in the filename.
# Arguments
filepath: string, path to save the model file.
monitor: quantity to monitor.
verbose: verbosity mode, 0 or 1.
save_best_only: if `save_best_only=True`,
the latest best model according to
the quantity monitored will not be overwritten.
mode: one of {auto, min, max}.
If `save_best_only=True`, the decision
to overwrite the current save file is made
based on either the maximization or the
minimization of the monitored quantity. For `val_acc`,
this should be `max`, for `val_loss` this should
be `min`, etc. In `auto` mode, the direction is
automatically inferred from the name of the monitored quantity.
save_weights_only: if True, then only the model's weights will be
saved (`model.save_weights(filepath)`), else the full model
is saved (`model.save(filepath)`).
period: Interval (number of epochs) between checkpoints.
"""
def __init__(self, filepath, monitor='val_loss', verbose=0,
save_best_only=False, save_weights_only=False,
mode='auto', period=1):
super(ModelCheckpointDetached, self).__init__()
self.monitor = monitor
self.verbose = verbose
self.filepath = filepath
self.save_best_only = save_best_only
self.save_weights_only = save_weights_only
self.period = period
self.epochs_since_last_save = 0
if mode not in ['auto', 'min', 'max']:
warnings.warn('ModelCheckpoint mode %s is unknown, '
'fallback to auto mode.' % mode, RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
self.best = np.Inf
elif mode == 'max':
self.monitor_op = np.greater
self.best = -np.Inf
else:
if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
self.monitor_op = np.greater
self.best = -np.Inf
else:
self.monitor_op = np.less
self.best = np.Inf
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.epochs_since_last_save += 1
if self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
filepath = self.filepath.format(epoch=epoch, **logs)
if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
warnings.warn('Can save best model only with %s available, '
'skipping.' % self.monitor, RuntimeWarning)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('Epoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s'
% (epoch, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
detachmodel(self.model).save_weights(filepath, overwrite=True)
else:
detachmodel(self.model).save(filepath, overwrite=True)
else:
if self.verbose > 0:
print('Epoch %05d: %s did not improve' %
(epoch, self.monitor))
else:
if self.verbose > 0:
print('Epoch %05d: saving model to %s' % (epoch, filepath))
if self.save_weights_only:
detachmodel(self.model).save_weights(filepath, overwrite=True)
else:
detachmodel(self.model).save(filepath, overwrite=True)
@OliPhilip, Thank you, I will try it tomorrow.
Or you could just pass in the template model and then replace all save methods on self.model to the template model.
```
class MultiGPUCheckpointCallback(Callback):
def __init__(self, filepath, base_model, monitor='val_loss', verbose=0,
save_best_only=False, save_weights_only=False,
mode='auto', period=1):
super(MultiGPU_Checkpoint_Callback, self).__init__()
self.base_model = base_model
self.monitor = monitor
self.verbose = verbose
self.filepath = filepath
self.save_best_only = save_best_only
self.save_weights_only = save_weights_only
self.period = period
self.epochs_since_last_save = 0
if mode not in ['auto', 'min', 'max']:
warnings.warn('ModelCheckpoint mode %s is unknown, '
'fallback to auto mode.' % (mode),
RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
self.best = np.Inf
elif mode == 'max':
self.monitor_op = np.greater
self.best = -np.Inf
else:
if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
self.monitor_op = np.greater
self.best = -np.Inf
else:
self.monitor_op = np.less
self.best = np.Inf
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.epochs_since_last_save += 1
if self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
filepath = self.filepath.format(epoch=epoch + 1, **logs)
if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
warnings.warn('Can save best model only with %s available, '
'skipping.' % (self.monitor), RuntimeWarning)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('Epoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s'
% (epoch + 1, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
self.base_model.save_weights(filepath, overwrite=True)
else:
self.base_model.save(filepath, overwrite=True)
else:
if self.verbose > 0:
print('Epoch %05d: %s did not improve' %
(epoch + 1, self.monitor))
else:
if self.verbose > 0:
print('Epoch %05d: saving model to %s' % (epoch + 1, filepath))
if self.save_weights_only:
self.base_model.save_weights(filepath, overwrite=True)
else:
self.base_model.save(filepath, overwrite=True)
@kevinfaust0308 this callback works perfectly for me! thanks!
@kevinfaust0308 Thank you! works like a charm.
@OliPhilip what is it mean that we are going to save the weights only from model number 1 ? is this weights are from all parallel models ? Thanks !
The correct way is to use the model that was compiled using CPU with the callbacks inside the GPU model :
modelGPU = multi_gpu_model(modelCPU, gpus = NUM_OF_GPUS)
modelGPU.__setattr__('callback_model',modelCPU)
modelGPU.fit_generator.....
@kevinfaust0308 Do you mean use it as checkpoint = MultiGPUCheckpointCallback('./weights.{epoch:04d}-{val_loss:.2f}.hdf5', save_weights_only=True) ?
@OliPhilip I had the same idea to extract the base model. I found the layer names unreliable in my case so I used the class. Of course this isn't going to work if you have multiple embedded models.
from keras.engine.training import Model
...
def extract_base_model(m):
""" Extract base model if encapsulated.
# Arguments
:param m: obj, keras model
# Returns
:return: obj, keras model
"""
for l in m.layers:
if isinstance(l, Model):
return l
return m
@stavBodik The only problem with this is that all other callbacks will also use the "template" model, which may not be desirable (something about unintended side effects). It's too bad Keras didn't provision for an arg to the underlying Callback class that would allow specification of the model on a per-callback basis, but I think I might try to submit a change to facilitate that, if not for the base class then at least for ModelCheckpoint.
@evictor What is the status of this issue these days?
Using the latest keras version, I'm able to save the multigpu_model's weights. I can then successfully load them with load_weights into a similar multigpu model. But I'm struggling to transform these "multi_gpu weights" into "single GPU" or "CPU weights". What I have tried looks like this:
# set up my basic model
model.compile(...)
model_to_write = model
# write it up to reload it easily after the fitting (I like YAML)
model.to_yaml(open('model.yaml', 'w'))
# setting it up for multiple GPUs
model = keras.utils.multi_gpu_model(model_to_write, gpus=4, cpu_relocation=True)
model.fit(...,callbacks=[keras.callbacks.ModelCheckpoint(
filepath='multi_weights.h5'
save_best_only=True,
save_weights_only=True,
verbose=1)])
# reload the best weights of the multi_gpu_model
model.load_weights('multi_weights.h5')
# invoke save_weights on the original model
model_to_write.save_weights('weights.h5')
[...]
Epoch 00034: val_loss improved from -0.93578 to -0.93615, saving model to multi_weights.h5
[...]
Epoch 00084: val_loss did not improve from -0.93615
Epoch 00084: early stopping
After doing that, I tried, in another python session, to reload the weights and recompute the loss on my validation set.
model = keras.models.model_from_yaml(open('model.yaml', 'r'))
model.load_weights('weights.h5')
y_pred = model.predict(x_validation)
print(K.eval(loss(y_pred, y_validation))
This gave me very bad results, so I think calling the save_weights method on the original model did not work. I also tried model.load_weights('multi_weights.h5', by_name=True), which didn't raise errors but resulted in the same (awful) loss values.
However, I could find the loss of my "best" model this way:
multi_model = multi_gpu_model(model, gpus=4, cpu_relocation=True)
multi_model.load_weights('multi_weights.h5')
y_pred = multi_model.predict(x_validation)
print(K.eval(loss(y_pred, y_validation))
In fact, I even had slightly better values for the loss than in the standard output of the fitting script. I guess this is due to numerical weirderies…
To sum up, I trained a model on 4 GPUs and it works great, but I cannot find how to use the weights I save when I have 0 or 1 GPU available…
Why doesn't my way of reloading the best weights on the multigpu model and then saving the original model work? Should I use @OliPhilip's dirty solution? Any help is appreciated…
@truenicoco I made a little adapter to save an alternative model: https://github.com/TextpertAi/alt-model-checkpoint
I’m not sure I see the problem with your script but you are doing something weird loading weights after training like that. You shouldn’t have to reload weights in the script there; both the multi GPU model and your template model should have the latest weights after that fit call.
I think my alt model checkpoint adapter will suit you though because it will let you save the template model weights directly.
Hope this helps, on a mobile phone right now so can’t do a lot
Edit, blog post for more info: https://www.textpert.ai/aime-blog/saving-multi-gpu-models-with-keras-modelcheckpoint
I should have mentioned it, but the reason why I am reloading the weights is that I use save_best_only option and I believe it is critical in my case to avoid overfitting. Thank you for the link and for the package, even if I don't end up using it, it will definitely be useful.
Ah, I see. Well yes, alt-model-checkpoint should help you with this scenario—it will let the template model's best weights be saved which should then load in fine. I _think_ it should be a drop-in fix for you.
It's surprising to me that your ad hoc predictions were way off even though weight loading worked; I think usually if weights were messed up you would see errors during loading, leading me to believe there might be something else at play, like incorrect data passed to predict().
Most helpful comment
Or you could just pass in the template model and then replace all save methods on self.model to the template model.
```
class MultiGPUCheckpointCallback(Callback):