Keras: resume training from previous epoch

Created on 2 Mar 2016 · 29Comments · Source: keras-team/keras

I saved the model and weights after each epoch using callbacks.ModelCheckpoint. I want to train it again from the last epoch.
How to set the model.fit() command to start from the previous epoch?

Source

Bhee

👍4

Most helpful comment

when I call model.fit() after loading models and weights , it showing epoch = 1. If I stop the training at 100 epoch. I want to resume the training with epoch=101.

Bhee on 3 Mar 2016

👍70 🎉7 ❤5

All 29 comments

Do you want to do something special with the history? If not, you can just call .fit one or several times and you will be able to continue to train the model. If you want to continue the training in another process, you just have to load the weights and call model.fit().

tboquet on 2 Mar 2016

👍3

when I call model.fit() after loading models and weights , it showing epoch = 1. If I stop the training at 100 epoch. I want to resume the training with epoch=101.

Bhee on 3 Mar 2016

👍70 🎉7 ❤5

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

ymcui on 3 Mar 2016

👎22 😕5 👍5 🚀1

thank u

Bhee on 3 Mar 2016

😄1

@ymcui is right, the label of epoch is only a name for the iterations in the current fit. Sorry when I said history I meant the history dictionary the fit method returns. I think #1868 is basically the same question. If you think it resolves your problem please close the issue!

tboquet on 3 Mar 2016

But there is a problem with this approach. What about hyper parameters that change according to epoch, say learning rate with a decay. Just restarting it with fit method doesn't take that into account.

nithishdivakar on 21 Apr 2016

👍80 🚀1

yeah, this happens to me when I resume the training process by loading weights.

I was training resnet18 with imagenet dataset, the model saved the weights at 1st epoch with lr=0.1 at beginning. I stopped it, then tried the resume functionality, and it turns out that the model starts with the same lr=0.1, and the loss increase for each iteration. To set the lr to the state of the 1st epoch, I changed the lr according to SGD lr update func: lr = lr * (1. / (1+decay*iterations)), however, it didnt work, the loss sill increases, but slower than with lr=0.1. Probably I should still lower the lr, but I dont understand why the loss still increase even the lr is set accordantly.

lolongcovas on 20 Dec 2016

😕1

Try the initial_epoch argument in .fit method.

ywenlu on 24 Jan 2017

👍36 ❤4 🎉3 😄2

using initial_epoch didn't work in this case

smhoang on 16 Feb 2017

But there is a problem with this approach. What about hyper parameters that change according to epoch, say learning rate with a decay. Just restarting it with fit method doesn't take that into account.

Setting the initial_epoch in fit_generator is not enough to solve this problem when using the ReduceLROnPlateau callback because there's no way for the callback to know what the learning rate should be without having the history of the previous (ie. before resuming training) epochs. Perhaps the callback constructor should have an optional history parameter that can be used to correctly initialize the learning rate and the wait variable (see https://github.com/fchollet/keras/blob/ab3b93e8dd103f1d9729305825791a084c7c8493/keras/callbacks.py#L744)

lewfish on 9 Mar 2017

👍4

Besides using the initial_epoch argument of fit, I re-wrote the history callback:

class History(Callback):
    """
    Callback that records events into a `History` object.

    This callback is automatically applied to
    every Keras model. The `History` object
    gets returned by the `fit` method of models.
    """

    def on_train_begin(self, logs=None):
        if not hasattr(self, 'epoch'):
            self.epoch = []
            self.history = {}

    def on_epoch_end(self, epoch, logs=None):
        logs = logs or {}
        self.epoch.append(epoch)
        for k, v in logs.items():
            self.history.setdefault(k, []).append(v)

This allows using the same callback and it just appends to the end. @fchollet should I post a pull request for this? It seems to me that this is more useful than the current behaviour of overwriting the
logs in on_train_begin.

MartinThoma on 25 Apr 2017

👍18

@MartinThoma ,
One would probably need to replace this line with

    if initial_epoch==0:
        self.history = cbks.History()

to make your suggestion work, right? I've tried to make this stuff work, and eventually ran into a feeling that too many different things should be changed, see #6697. What do you think?

i3v on 20 May 2017

👍1

if you want to resume from epoch 101 ,simply use "initial_epoch = 101" in model.fit().

initial_epoch: Epoch at which to start training (useful for resuming a previous training run)

syedfaizalex on 16 Dec 2017

👎4 👍2

Seems that tensorflow estimators also support resuming training. "Since the state of the model is persisted (in model_dir=PATH above), the model will improve the more iterations you train it, until it settles"

jperl on 24 Dec 2017

Related question: What happens to all the gradient computations that rely on a history of the gradients (when momentum is present, such as in ADAM and most gradient descent algorithms)? Does the checkpoint store these as well? Thanks!

bupedroni on 20 Mar 2018

👍12

@bupedroni: As far as I know, every time I loaded the existing model, all the hyperparameters were set to default values.

Best way to resume is to write a custom callback and store all the hyperparameters and then start the training as mentioned by @MartinThoma

valekar on 25 Jun 2018

@MartinThoma I'd like a pull request implementing that, basically I'm training a model, but if I notice that the metrics haven't diverged I'd like to train for another x epochs. And also be able to plot the history overall in an additive way.

For now I'm just accumulating histories like this https://www.kaggle.com/morenoh149/keras-continue-training

morenoh149 on 28 Jul 2018

👍1

Still have this issues... any update on it?

imranparuk on 23 Oct 2018

anything new here?

thebeancounter on 14 Dec 2018

just port your code to pytorch :laughing:

imranparuk on 14 Dec 2018

👍6 👎1

Ya. That actually worked for me. 2 years and counting.

nithishdivakar on 14 Dec 2018

😄2

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

So does that mean if i call
model.fit(epochs = 20)

and

model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)

both are same ??

rushic24 on 7 Apr 2019

I think it is no matter whether it SHOWs the training is at epoch = 1 or epoch = 101.
As far as I know, the model itself doesn't save the EPOCH information into model file.
If you have loaded the correct previous model (the model should have been saved with epoch number), it should be no problem on continuing your training process.

So does that mean if i call
model.fit(epochs = 20)

and

model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)
model.fit(epochs=5)

both are same ??

Yes, they are equivalent. At least that is what I found using the TensorFlow Keras API in TensorFlow 2.0

srcolinas on 19 Apr 2019

How can I get the epoch at which model was saved in ModelCheckpoint ?

MunishaTripping on 4 Oct 2019

👍1

save epoch number in the name of the model. Fetch that number with regex when resuming training.

hollowgalaxy on 10 Nov 2019

👎2 👍2

I managed to do this with an optimizer whose learning rate depends on the number of iterations eg Adam.

Here is the pseudo-code:

...
if os.path.isfile(checkpoint_path+".index"):
    # This loads `(root).optimizer.iter`from the checkpoint
    model.load_weights(checkpoint_path)

# Recover the iterations from the model and convert to epochs
initial_epoch = model.optimizer.iterations.numpy() // STEPS_PER_EPOCH
callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True)
model.fit(train_data, epochs=NUM_EPOCHS, initial_epoch=initial_epoch,
                 callbacks=[callback])

Hope this helps :-)