Does it work when you construct the model with the original code instead of loading it from json?

NasenSpray on 18 Apr 2016

Nope. It doesn't. Still starts with 20% accuracy as it did on the 1st epoch.

trane293 on 18 Apr 2016

👍5 🚀1

Did the weights file already exist before you tried to save them?

NasenSpray on 18 Apr 2016

It did, but now I tried using the ModelCheckpoint callback which saves weight files for each epoch. In my case last weights file for epoch 70 was created (it was not present), I tried loading that into the model loaded using i) JSON ii) using original code, but still no luck.

trane293 on 18 Apr 2016

It did

That's it, save_weights() doesn't overwrite existing files unless you also pass overwrite=True. It should have asked for user input, though.

NasenSpray on 18 Apr 2016

👍4

Actually sorry for my last comment, all the architectures I save and all weights I save have unique names, and yes I know save_weights() asks for user input when overwriting the file, but in my case it doesn't since the files do not exist. Se we can safely rule out the possibility that the file was not overwritten.

trane293 on 18 Apr 2016

👍1

screen
You can see the weights saved after every epoch. When I try to load these weights the training still restarts from where it started initially.

Here's my full loadModel() function:

# optimzers
optim_sgd = keras.optimizers.SGD(lr=0.01, momentum=0.9, decay=0.002, nesterov=True)
optim_adadelta = keras.optimizers.Adadelta()
optim_adagrad = keras.optimizers.Adagrad()
optim_adam = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

imageSize = (19, 19)
img_rows, img_cols = imageSize[0], imageSize[1]
batch_size = 200
# number of convolutional filters to use
nb_filters = 32
# size of pooling area for max pooling
nb_pool = 2
# convolution kernel size
nb_conv = 3

nb_epoch = 1000

# callbacks
def scheduler(epoch):
    if epoch % 10 == 0 and epoch is not 0:
        x = float(input("Enter a learning rate (Current: {}): ".format(model.optimizer.lr.get_value())))
        model.optimizer.lr.set_value(x)
        print("Changed learning rate to: {}".format(model.optimizer.lr.get_value()))
    return model.optimizer.lr.get_value()

change_lr = oc.LearningRateScheduler(scheduler)
early_stop = oc.StopEarly(10)
plot_history = oc.PlotHistory()

# # Load the model
modelPath = './SegmentationModels/'
modelName = 'Arch_1_40'
model = model_from_json(open(str(modelPath + modelName + '.json')).read())
model.compile(loss='categorical_crossentropy', optimizer=optim_sgd)
model.load_weights(str(modelPath + 'weights.70-0.74.hdf5'))
#     import cPickle as pickle
#     with open(str(modelPath + modelName + '_hist.pckl'), 'r') as f:
#         history = pickle.load(f)
model.summary()

trane293 on 18 Apr 2016

That's strange...

~~replace this line with if 1: and try to load the weights again.~~ nope, dont

NasenSpray on 18 Apr 2016

I found out that I was using an older version of Keras. I upgraded the version and found model_summary() is no longer there. Delved deeper and found that it has now been changed to print_summary().

Anyways, I tried changing the line of code you asked, but that didn't work as well.

trane293 on 18 Apr 2016

UPDATE: Came to the institute this morning, built the model using original code and loaded the model weights saved using ModelCheckpoint callback. Started training and it still restarts from the beginning; no memory of past metrics. The performance is actually even worse than it was earlier when it started training the first epoch. In my case, normally the network starts at 20% accuracy and goes to around 70% in 60 epochs. But when I restart the training process using loaded weights, the network starts at 20% on epoch 1 and keeps going lower and lower until 16% at epoch 5. I have no idea what's happening here.

UPDATE 2: When I try to evaluate the loaded model + weights on the same validation data, I get 60% accuracy, as intended. But if I do model.fit() then training starts from 20% and oscillates on it. So I can confirm that the weights are being loaded correctly since the model can make predictions, but the model is not able to retrain.

Please help! @NasenSpray

trane293 on 19 Apr 2016

👍6

So what model do you have precisely? Perhaps some weights aren't actually saved or loaded at all (like the states in a LSTM or something)? Or perhaps they are accidentally shuffled (flipped dimensions or whatever) somehow.

EDIT: Check https://github.com/fchollet/keras/issues/2378#issuecomment-211910392

carlthome on 19 Apr 2016

Grasping at straws here, but some optimizers are stateful right? Are you just using SGD? I'm not familiar with this part of Keras but perhaps the optimizers should be saved as well because otherwise when you reinitiate learning and start a new epoch but with pretrained weights instead of your original weight initialization, perhaps training diverges due to high learning rates.

carlthome on 19 Apr 2016

Run this plz

model  = make_model()
w1 = model.get_weights()
model.load_weights('your_saved_weights.h5')
w2 = model.get_weights()

for a,b in zip(w1, w2):
  if np.all(a == b):
    print "wtf is happening"

Does it print?

NasenSpray on 19 Apr 2016

👀1 👍1

Doesn't print. The weights are loaded successfully I suppose. Its the training procedure that's problematic. After running this script (it didn't print anything), I ran model.fit() and it started with a loss 10x higher than originally it was at epoch 1, and with accuracy 20% again _sigh_

trane293 on 19 Apr 2016

👍5

Obviously something must be different as you're seeing different results. Perhaps get_weights() doesn't actually return everything it could.

I'm curious if you have the same problem just by restarting training in the same session, nevermind loading a model and its weights with Keras' builtins. If not, consider saving states with something like this instead.

carlthome on 19 Apr 2016

Thanks. When I restart an interrupted training process, the training continues from where it left of successfully. The problem is when I load the model and weights.

My main aim to save "snapshots" or "states" of the model that can be loaded back and used as a starting point when training on the other day. I'll have a look at the shelf module too, thanks!. But I think the problem with Keras must be debugged as well.

Please guide me on how can I help you reproduce this issue for you guys to fix it sometime in the future. I would love to help.

trane293 on 19 Apr 2016

In your loadModel(), hardcode the learning rate to 0. Does it make the loaded model better?

- and -

Instead of training, just evaluate the loaded model on the training set. Still worse than before?

NasenSpray on 19 Apr 2016

I'll try your suggestion as soon as I get to the institute tomorrow.

trane293 on 20 Apr 2016

I could not try validating on the training set for some reason, but I solved my problem by pickling the model after training it for the day. I restarted my iPython Notebook kernel, loaded the pickled model, and restarted the training process. Fortunately it started from where it left of.

I will also try your suggestion and report back what I got.

trane293 on 21 Apr 2016

Dang! That clearly means some states are not saved properly, whether they are weights or something else.

I assume the intended use case for having load and save functions in Keras has more to do with being able to share pretrained models like people do with Caffe, rather than it being for pausing your own training, in which pickling is probably safer.

I do wonder though if it wouldn't be easier to just scratch the manual parsing of states which is bug-prone and simply have everything rely on Python's builtin object serialization with pickle, shelve or similar. Keras' builtins are pretty meaty though so I'm probably missing something important in why they're needed.

I could do a pr with shelve for save_model(...), load_model(...), save_weights(...), load_weights(...) if it is of interest @fchollet.

carlthome on 21 Apr 2016

From what @carlthome said here, you could try to take a snapshot of the optimizer too.
I have 2 functions working to be able to serialize the model and the optimizer as in the pre 1.0 release. Note that I return a dictionnary instead of a json dump. It's basically something really similar to the old functionnalities.
You could try them and let me know if it's working (I didn't have the time to really test them extensively):

def get_function_name(o):
    """Utility function to return the model's name
    """
    if isinstance(o, six.string_types):
        return o
    else:
        return o.__name__

def to_dict_w_opt(model):
    """Serialize a model and add the config of the optimizer and the loss.
    """
    config = dict()
    config_m = model.get_config()
    config['config'] = {
        'class_name': model.__class__.__name__,
        'config': config_m,
    }
    if hasattr(model, 'optimizer'):
        config['optimizer'] = model.optimizer.get_config()
    if hasattr(model, 'loss'):
        if isinstance(model.loss, dict):
            config['loss'] = dict([(k, get_function_name(v))
                                   for k, v in model.loss.items()])
        else:
            config['loss'] = get_function_name(model.loss)

    return config


def model_from_dict_w_opt(model_dict, custom_objects=None):
    """Builds a model from a serialized model using `to_dict_w_opt`
    """
    if custom_objects is None:
        custom_objects = {}

    model = layer_from_config(model_dict['config'],
                              custom_objects=custom_objects)

    if 'optimizer' in model_dict:
        model_name = model_dict['config'].get('class_name')
        # if it has an optimizer, the model is assumed to be compiled
        loss = model_dict.get('loss')

        # if a custom loss function is passed replace it in loss
        if model_name == "Graph":
            for l in loss:
                for c in custom_objects:
                    if loss[l] == c:
                        loss[l] = custom_objects[c]
        elif model_name == "Sequential" and loss in custom_objects:
            loss = custom_objects[loss]

        optimizer_params = dict([(
            k, v) for k, v in model_dict.get('optimizer').items()])
        optimizer_name = optimizer_params.pop('name')
        optimizer = optimizers.get(optimizer_name, optimizer_params)

        if model_name == "Sequential":
            sample_weight_mode = model_dict.get('sample_weight_mode')
            model.compile(loss=loss,
                          optimizer=optimizer,
                          sample_weight_mode=sample_weight_mode)
        elif model_name == "Graph":
            sample_weight_modes = model_dict.get('sample_weight_modes', None)
            loss_weights = model_dict.get('loss_weights', None)
            model.compile(loss=loss,
                          optimizer=optimizer,
                          sample_weight_modes=sample_weight_modes,
                          loss_weights=loss_weights)
    return model

@carlthome, if this solution is ok, we could work on a PR that includes these functionnalities and the other relevant elements (weights, states, ...)?
It should be possible to include all of this in a HDF5 file.

tboquet on 21 Apr 2016

@tboquet, cool! Sounds good to me! I'm no authority on Keras but I would probably have based loading/saving around object serialization of Model() and Sequential() just to be safe. In the future, new things will probably be stateful which will screw up things again. The slight additional overhead of saving too much is worth the extra stability and reduced code complexity, in my mind.

carlthome on 21 Apr 2016

This is what I am using (took from keras docs) and it works without a problem on Keras 1.0:

def load_model():
    model = model_from_json(open('model.json').read())
    model.load_weights('weights.h5')
    model.compile(optimizer=rmsprop, loss='mse')
    return model


def save_model(model):    
    json_string = model.to_json()
    open('model.json', 'w').write(json_string)
    model.save_weights('weights.h5', overwrite=True)

I had one example with say 10 epochs and another example with save and load in a loop of 10 iterations each with 1 epoch, and the loss for both were similarly decreasing. Additionally both resulting models were predicting fine.

Have you tried to call model.load_weights before model.compile?

rtatishvili on 27 Apr 2016

👍2

Thank you for your suggestions everyone. I will try your suggestions again and revert back what I got. If the method described on official Keras documentation works for everyone, it should for me too. I will dig a little deeper and find out if its something I am doing wrong.

trane293 on 1 May 2016

I ran into a similar problem today. It really seems like it could be the optimizer that needs to be saved/loaded too, aside from the weights.

Basically anything like this seems to go bonkers (in my case loss='mse' and optimizer='rmsprop'):

# Starting fresh, training for a while and saving the weights to file.
model = create_model()
model.compile(...)
model.fit(...)
model.save_weights(...)

# Creating the model again, but loading the previous weights and resuming training.
model = create_model()
model.compile(...)
model.load_weights(...)
model.fit(...) # Diverges!

The data is the same in both fit calls.

carlthome on 14 May 2016

@carlthome Had the same problem. Didn't check recently for the current status but now I use vanilla cPickle to pickle my trained model. Loading the pickled model and resuming training seems to be working just as expected. However I'm not sure about the JSON + h5 weight saving/loading functionality. If you are having the same problem then there must be something wrong.

trane293 on 15 May 2016

👍1

@carlthome: RMSprop makes _really_ shitty updates during the first couple of steps which easily wreck pre-trained models. Could you retry with plain SGD?

NasenSpray on 15 May 2016

I also encountered this problem training a 2-layer LSTM with one dense layer at the end. Testing showed the following:

-Compiling two identical models in the same script, training the first model and then loading the weights in the second model via save_weights and load_weights worked as it should even if the two models had separate optimizer instances. If I did this and then started training with the second model its training loss was the same as the training loss of the first model when the weights were saved, as expected.

-However, once Python was closed and reopened loading weights saved in the previous instance resulted in, if anything, a -worse- loss at the start of training than the untrained model, though it quickly learned again.

-I'm not sure if the optimizer is at fault, because I've tried saving the weights from a model, reloading them and then testing predictions without any further training. If the two models were compiled in the same session it works fine, but if I close the session, start a new session, compile a new model and load the previous session's weights then its predictions are garbage.

Rocketknight1 on 21 May 2016

Also, I'm using the Theano backend and training on Windows with CUDA, which is probably a weird use-case. Not sure what backend/OS the other people with this problem are using.

Rocketknight1 on 21 May 2016

Wait, I've been -extremely- stupid, please ignore.

For the interested, I was making a character-prediction RNN with a one-hot character encoding, but instead of pickling the map of characters to one-hot indices I was generating it in the code each time from a set of allowed characters using enumerate(). This of course meant that the mapping generated by enumerate() was different every time, because sets have no guaranteed order, which explains why everything worked fine until I restarted the script (and so regenerated the mapping).

This is embarrassingly obvious in retrospect.

Rocketknight1 on 21 May 2016

👍5

I'm having this same issue using adagrad. After hours of training, when I load weights and resume, my MSE goes back up to where it started on the first epoch (the first epoch ever, where it was using the initial random weights).

What's the disadvantage of using vanilla cPickle instead of the save_weights and to_json (which don't seem to work unless you're using SGD)?

IanLondon on 14 Jun 2016

Heyho. I'm new to Deep Learning and Keras and ran into the same / a similar issue. I trained my model with SGD for some time and saved the weights after each epoch using the save_weights() function. When I load weights from a particular epoch and I use SGD again, everything is fine (evaluation metrics are still good).

Additionally, I tried to use my already learned weights but use a different optimizer for further training. When choosing Adam, Adagrad or RMSprop the evaluation metrics dropped and it looked like as if the learning started from scratch.

How can this happen? Why is everything fine, when I use SGD again - _even with changed learning rate_ - but not when using a different Optimizer?

Thanks for your help!

EDIT:

@carlthome: RMSprop makes really shitty updates during the first couple of steps which easily wreck pre-trained models. Could you retry with plain SGD?

@NasenSpray Hmm. Could this be my problem? As far as I know all my chosen optimizers are related to RMSprop. Could they all 'destroy' my already learned weights and affect the performance in such negative way?

thomasgolda on 9 Aug 2016

It generally _not_ advisable to retrain a pretrained model on an altogether
different optimizer compared to what it was trained on. This just doesn't
make any sense. My question is - do you have a valid reason behind this
setting, where you want to train a pre-trained network using a different
optimizer like RMSProp or Adam?

On Tue, Aug 9, 2016 at 6:17 PM, Thomas [email protected] wrote:

Heyho. I'm new to Deep Learning and Keras and ran into the same / a
similar issue. I trained my model with SGD for some time and saved the
weights after each epoch using the save_weights() function. When I load
weights from a particular epoch and I use SGD again, everything is fine
(evaluation metrics are still good).

Additionally, I tried to use my already learned weights but use a
different optimizer for further training. When choosing Adam, Adagrad or
RMSprop the evaluation metrics dropped and it looked like as if the
learning started from scratch.

How can this happen? Why is everything fine, when I use SGD again - _even
with changed learning rate_ - but not when using a different Optimizer?

Thanks for your help!

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/fchollet/keras/issues/2378#issuecomment-238541753,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AJzCfX6axe8wAHs_UFXuAcreh9gLx69eks5qeHbcgaJpZM4IJZgO
.

trane293 on 9 Aug 2016

👍1

Does this also apply to partially pretrained models? For example if you have a network with 5 convolutional layers and you take the weights for the first 3 layers from a pretrained network (_transfer learning_) and set trainable=False for those layers?

Concerning your question: As I wrote, I'm new to Keras and Deep Learning. I'm trying to get a feel for different techniques, so I'm playing around a bit and observing the resulting effects and trying to understand the behaviour.

thomasgolda on 9 Aug 2016

👍1

Sorry if I'm bumping an old thread - is this resolved for you folk?

shriphani on 28 Dec 2016

Just hit this myself. I think the confusion is model.load_weights() only loads the model weights, it does not load any of the intermediate state of the optimizer. If you want to completely reload a model the only option appears to be load_model().

I would propose closing this issue as works-as-designed and updating the FAQ to make this a little bit more clear: if you want to resume training, your best option is load_model().

moof2k on 18 Jan 2017

load_model isn't documented: https://keras.io/models/about-keras-models/

shriphani on 18 Jan 2017

But it is documented here: https://keras.io/getting-started/faq/

shriphani on 18 Jan 2017

I have the same problem on keras 1.2.0.
It was fixed on 1.2.1.

liuaifu on 20 Jan 2017

👎1

is this fixed?

farahanams on 16 May 2017

I've been using the latest version of keras. I can confirm this problem is not fixed even with model.save.

shriphani on 16 May 2017

Any update?

Ethiral on 2 Jun 2017

model.save_weights() saves only the weights of the model. Instead try using model. save() and load_model() to save and reload the model respectively which saves the entire model state.

model.save("model.h5", overwrite=True)
.
.
model = load_model("model.h5") """when reloading the model"""

shalabhsingh on 8 Jun 2017

@shalabhsingh does not help

unnir on 4 Jul 2017

Can someone with this issue please provide a complete and minimal example that reproduces the issue? There are tests in place to check that this does not happen, so we need to understand what is different from those tests to nail it down. Try to use a dataset from Keras, so we can all easily reproduce it.

Thanks!

jorgecarleitao on 5 Jul 2017

@Rocketknight1
Thanks, your posts made me aware I was doing the same thing. A lot of people might have this issue because the code referenced in

https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/

gets exactly this wrong. This code section in RNN_utils.py

data = open(DATA_DIR, 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)

should be something like

data = open(data_dir, 'r').read()
chars = list(set(data))
chars.sort() # SORT THE CHARS so mapping is the same even when restarting the script!
VOCAB_SIZE = len(chars)

instead so that the char mapping is always the same when reading the same file in a new session.

phugen on 8 Sep 2017

❤2 👍2

Ran into the same issue. Is it sorted folks?

Abhijit-2592 on 17 Nov 2017

I saved a model with model.save(mdl), and load it with the following codes which works great.

    if mdl == None:
        model = Sequential()
        model.add(Dense(256, input_dim=n_input, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(256, activation='relu'))
        model.add(Dropout(0.3))
        model.add(Dense(128, activation='relu'))
        model.add(Dropout(0.1))
        model.add(Dense(n_classes, activation='softmax'))

        model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
    else:
        model = load_model(mdl)
        # Then train it with a usual way
        model.fit_generator(...)

lionlai1989 on 22 Nov 2017

@lionlai1989 Can you verify saving it from one session, and load it into different session? Or even load it and make predictions in different machine entirely? model.save and load_model does not work for me. If I load the model, the accuracy goes back to like it has never been trained.

FMFluke on 24 Nov 2017

Got the same issue. Fuck it, It's not solved. Spent 18 hours training a DenseNet on AWS to get to 89% accuracy on Cifar10, the connection interrupted but I thought I was safe because I had my model saved every 30 epochs. The truth is that it works for model.test(), but when I try model.fit(), it breaks and reverts to 10% accuracy when it was 89%. I've lost 1 day of work due to this shitty issue.

EricAlcaide on 1 Dec 2017

😕3 👍2

@EricAlcaide Would you mind providing your code? Because at least you can use it to test while my model can't do anything at all. Did you try using it to make predictions?

FMFluke on 4 Dec 2017

So as far as I got it, it's still not possible to load the state of the optimizer other than using load_model()?

Instead of using 'load_model', it would sometimes be nicer to define the nn-architecture, load the associated weights, load the saved optimizer and then compile again.
This would make it a lot easier to change nn-architectures in between epochs (e.g. add new layer, change dropout rate, etc.).

ViaFerrata on 6 Dec 2017

hi,

I think I found an answer in a different post. It's about the implementation of Adam and RMSProp etc in Tensorflow. When the model finds good weights on the first day it creates small losses, and Adam and other optimizers with a given learning rate ignores the previously adapted (and probably smaller) learning rate, and restarts the learning with the issue described below (basically: low errors are handled with a small epsilon). So saving the adapted learning rate could help too.

https://github.com/ibab/tensorflow-wavenet/issues/143

let me quote here:

Explanation

I've seen the behavior Zeta36 is describing in our test/test_model.py. When that test uses adam or rmsprop, I would see the loss drop and drop till a small number, then jump up to a large loss at some random time.

You can reproduce that problematic behavior if you change (what at the time of this writing in master) MomentumOptimizer to AdamOptimizer, make the learning rate 0.002 and delete the momentum parameter. Uncomment the statements that print loss.

If you run the test with

python test/test_model.py

every second or third time or so that you run the test you will see the loss will drop and then at some point jump up to a larger value, and sometimes cause the test to fail. I "worked around" that problem by futzing with the learn rate and number of training iterations we run in the test until it would reliably pass.

Anyway, I think I've found the cause. If you look at the tensorflow implementations of rmsprop and adam you will see they compute the change to a weight by dividing by a sinister lag-filtered rate-of-change or error magnitude. When the error or rate of change of error gets small, or even zero, near the bottom of the error basin, then the denominator gets close to zero. The only thing saving us from a NaN or Inf is they add an epsilon in the denominator. That epsilon defaults to 1e-10 for rmsprop and 1e-8 for adam. That's enough to make the change to our parameter a big number, presumably big enough give us a large loss.

So in PR 128 I specified a larger epsilon for rmsprop, and in PR 147 for adam. I found that these changes fix the problem of randomly increasing loss during the tests in test_model.py.

guyko81 on 7 Dec 2017

❤1 🎉1 👍1

I just read through this and I have the same issue (Adam optimizer). If you are having troubles reproducing it, you can pull this down: https://github.com/brainstain/CapsNet

Just change the epochs to 2 and run it twice. I'll try @guyko81 's advice and save off the learning rate in a bit. I'll post an update on whether the symptoms are improved.

brainstain on 11 Dec 2017

Hello everybody, does this issue still hold with latest keras? i have an MLP network which i want to re-train once i have more data, ie incremental fitting. The model will be saved and loaded for preciction and also loaded to be re-trained with more data without losing previous performance if possible. How can i do that? Please advise.

Thanks,
Nikos

foo123 on 2 Jan 2018

Lowering the learning rate on the second instance helped me.

But you can save the whole model with model.save
https://keras.io/getting-started/faq/
"
How can I save a Keras model?
Saving/loading whole models (architecture + weights + optimizer state)
It is not recommended to use pickle or cPickle to save a Keras model.

You can use model.save(filepath) to save a Keras model into a single HDF5 file which will contain:

the architecture of the model, allowing to re-create the model
the weights of the model
the training configuration (loss, optimizer)
the state of the optimizer, allowing to resume training exactly where you left off.
"

I mean at Adam we don't really change the learning rate, we rather calculate the moving average of the gradient and the square of it at weight level, so you need to save everything!

guyko81 on 2 Jan 2018

I think this issue has been fixed. Use model.save() to save the model, and import the saved model using:

from keras.models import load_model
model = load_model('my_model.h5')

If you want to resume training it should work as this function saves all the optimiser information as well.

trane293 on 3 Jan 2018

👎6

Nope, I don't think the issue has been solved. For me its still the case that after training for a long time I reach a particular accuracy, save the model with ModelCheckpoinnt which essentially is using model.save() and then when I resume training I get a 20% difference from the actual accuracy when I first saved the model. Which means I still have to spend a whole lotta of time and energy to resume training. Not good for the environment either. The optimizer here is Adam.

kirk86 on 31 Jan 2018

I encountered the same problem using load_weights.
But my situation is a little bit complex, I have 2 models sharing some layers.
So I can't use 'load_model' here, what can I do to resume training?

And I know it's rare, but what if someone wants to inherit the Model?
Won't it be quite problematic to load a subclass of Model using 'load_model'?

zeka0 on 8 Mar 2018

indeed not solved, same issue for me
tried 2 things, both did not work

1)
from keras.models import load_model
model = load_model('my_model.h5')

2)
def load_model():
    model = model_from_json(open('trainingmodel.json').read())
    model.load_weights('weights.h5')
    model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
    return model


def save_model(model):    
    json_string = model.to_json()
    open('trainingmodel.json', 'w').write(json_string)
    model.save_weights('weights.h5', overwrite=True)`

tested with latest versions
-Keras 2.1.5
-Python 3.6.3
-Tensorflow 1.5.0

But Keras does give this warning which points to the problem. In my network I use the output of 1 LSTM (the encoder) to initialize the input of the second (decoder). Apparentely this input can not be serialized... So, I'm stuck...

C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\topology.py:2368: UserWarning: Layer lstm_2 was passed non-serializable keyword arguments: {'initial_state': [<tf.Tensor 'lstm_1/while/Exit_2:0' shape=(?, 256) dtype=float32>, <tf.Tensor 'lstm_1/while/Exit_3:0' shape=(?, 256) dtype=float32>]}. They will not be included in the serialized model (and thus will be missing at deserialization time). str(node.arguments) + '. They will not be included '

wimthiels on 14 Mar 2018

Same issue here, and also with LSTM. Keras 2.1.2, TF 1.5, Python 3.6. When I am loading using load_model(), training continues well. But, when I am loading only weights, all the following iterations never improve loss/val_loss.

model.fit(...)
model.save("lstm_model.h5")

if os.path.isfile("lstm_model.h5"):
   #1: model = load_model("lstm_model.h5")    # training continues well
   #2: model.load_weights("lstm_model.h5")    # losses never fall anymore

model.fit(...)

randaller on 15 Mar 2018

I'm having the same issue. Whenever I run the "fit" function of a saved model, all weights go bad and my predictions are wrong.

opt = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, decay=0.01)
rnn_model.compile(loss='binary_crossentropy', optimizer=opt, metrics=["accuracy"])
rnn_model.save('./models/my_model.h5')

#This predicts correctly
model = load_model('my_model.h5')
model.predict(x)

#This does NOT predict correctly
model=load_model('my_model.h5')
model.fit(X, Y, batch_size = 5, epochs=1)
model.predict(x)

It looks like it's not able to resume training. Anyone have any suggestions, please?

Update:
I haven't figured out the root of the problem. But it seems that the model that I was loading was saved on Keras 2.0.6 and I am loading it on to Keras 2.1.5. Something with the "save_weights" and "load_weights" functions was not working, so I had to load the weights layer by layer on an architecture I built from scratch manually (loading the architecture from the saved model using json worked as well):

for layer_loaded, layer_built in zip(loaded_model,built_model):
   layer_built.set_weights(layer_loaded.get_weights())

r8drascal on 20 Mar 2018

I am having the same problem, using model.save and load_model() but somehow lose the earlier training when I use ModelCheckpoint and a custom callback to compute AUROC on my loaded model. If I del model, load the model and train the model in the same instance without re running my callbacks code, it continues to train from the previous run. Any ideas? Using keras 2.1.4 and python 3.6.4

early = EarlyStopping(monitor="val_loss", mode="min", patience=3)

auroc = MultipleClassAUROC(sequence=validation_sequence,
                           class_names=class_names,
                           weights_path=output_weights_path,
                           stats=training_stats,
                           workers=8)
checkpoint = ModelCheckpoint(output_weights_path, verbose=1, 
                             save_best_only=True, save_weights_only = True)
callbacks = [early, checkpoint, auroc,
            ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=patience_reduce_lr,
                              verbose=1, mode="min", min_lr=min_lr)]

and my model.fit_generator

history=model.fit_generator(generator=train_sequence,
                            steps_per_epoch=10,
                            epochs=1,
                            validation_data=validation_sequence,
                            validation_steps=validation_steps,
                            callbacks=callbacks,
                            class_weight=class_weights,
                            workers=8,
                            shuffle=False, verbose=1)

sheikhumar93 on 5 Apr 2018

I got the same problem. Anyone has an idea?

LeZhengThu on 7 Apr 2018

I got the same problem!

whikwon on 9 Apr 2018

I got the same problem too, any ideas?

My architecture is:

Encoder part

encoder_input_layer = Input(shape=(None,), name='encoder_Input')
encder_embedding_layer = Embedding(src_token_num, embedding_dim, name='encoder_Embedding')(encoder_input_layer)
encoder_lstm_1_layer = LSTM(embedding_dim, return_sequences=True, return_state=True, name='encoder_LSTM_1')(encder_embedding_layer)
encoder_lstm_2_layer = LSTM(embedding_dim, return_sequences=True, return_state=True, name='encoder_LSTM_2')(encoder_lstm_1_layer)
encoder_output, state_h, state_c = LSTM(embedding_dim, return_state=True, name='encoder_LSTM_Final')(encoder_lstm_2_layer)
encoder_states = [state_h, state_c]

Decoder part

decoder_input_layer = Input(shape=(None,), name='decoder_Input')
decoder_embedding_output = Embedding(trgt_token_num, embedding_dim, name='decoder_Embedding')(decoder_input_layer)
# Use encoder states to initialize decoder LSTM
decoder_lstm_1_layer = LSTM(embedding_dim, return_sequences=True, return_state=True, name='decoder_LSTM_1')
decoder_lstm_1_layer_output = decoder_lstm_1_layer(decoder_embedding_output, encoder_states)
decoder_lstm_2_layer = LSTM(embedding_dim, return_sequences=True, return_state=True, name='decoder_LSTM_2')
decoder_lstm_2_layer_output = decoder_lstm_2_layer(decoder_lstm_1_layer_output)
# State_h and state_c discarded.
decoder_lstm_fianl_layer = LSTM(embedding_dim, return_sequences=True, return_state=True, name='decoder_LSTM_Final')
decoder_lstm_output, _, _ = decoder_lstm_fianl_layer(decoder_lstm_2_layer_output)
# Classify words
decoder_dense_1_layer = Dense(embedding_dim, activation='relu', name='decoder_Dense_1_relu')
decoder_dense_1_output = decoder_dense_1_layer(decoder_lstm_output)
decoder_dense_final_layer = Dense(trgt_token_num, activation='softmax', name='decoder_Dense_Final')
decoder_dense_output = decoder_dense_final_layer(decoder_dense_1_output)

Final Model

encoder_decoder_model = Model(inputs=[encoder_input_layer, decoder_input_layer], outputs=decoder_dense_output)

When I tried to save model, I got warning as follows:

When I load saved model, and ran

model.fit()

the training accuracy is close to 0!!!

ruiyuanlu on 16 Apr 2018

Hello everyone, I faced the same problem. But I think it's solved in my case.
First save the model and weights as in the code below....

Save the final model

model_json = model.to_json()
mdl_save_path = 'model.json'
with open(mdl_save_path, "w") as json_file:
json_file.write(model_json)

serialize weights to HDF5

mdl_wght_save_path = 'model.h5'
model.save_weights(mdl_wght_save_path)

Then I started another session completely closing all python opened files for retraining.I also tried this by checkpointing the model while training. At the time of resuming training, I first loaded the model architecture from .json file and then loaded the weights using load_weights() from .h5 file. Then compiled the model using model.compile() and fit it with model.fit().

N.B: I used SGD at both times while training and resuming training.....It worked.........

Though I did not check this with other optimizers. I saw that at the time of retraining if I use other optimizers other than SGD(I used SGD in normal training), the issue persists. So, I am pretty confident using different optimizers during normal training and resumed training will cause you a problem.

NahianHasan on 30 Apr 2018

👍2

Guys,

I fixed the problem by reducing the learning rate to 1e-5 (small lr for Adam) when I fine tune my pretrained model, which has been trained using Adadelta with a much higher starting lr. I think, the issue is the starting lr for Adam that mess things up. For fine tuing, just use a small lr for a new optimiser of your choice.. Hope this helps.

peymanrah on 3 Jun 2018

👍8 😕3 ❤2 😄1

Thanks @peymanrah this is exaclty the solution I needed!

PatrickLemke on 7 Jun 2018

@peymanrah when you first trained the model with Adam, what learning rate did your training stop at. Trying to see by what factor I should decrease the lr for fine_tuning

ad12 on 28 Aug 2018

@guyko81 @peymanrah I agree with you! The main reason is that when we save the model after several epochs, its learning rate is pretty smaller. Therefore, once we load the same model and continue training it, we have to set the learning rate to its last value instead of the default value( default value is for the first epoch, it is very large).

Thank you very much. This problem is fixed finally.

Ordgod on 12 Dec 2018

It is sad that such a basic issue has still not been solved.

It is happening because model.save(filename.h5) does not save the state of the optimizer. So the optimizers like Adam, RMSProp does not work but SGD works as mentioned in one of the previous comments (I verified this) since it is stateless optimizer (learning rate is fixed).

This is just sad that such a popular library has such basic/glaring/trivial bugs/problems :(

bnaman50 on 19 Dec 2018

👍4

Guys,

I fixed the problem by reducing the learning rate to 1e-5 (small lr for Adam) when I fine tune my pretrained model, which has been trained using Adadelta with a much higher starting lr. I think, the issue is the starting lr for Adam that mess things up. For fine tuing, just use a small lr for a new optimiser of your choice.. Hope this helps.

Reducing learning rate solved my problem, too. At first lr was 0.01 and then I reduced it to 0.001 at second try. After one epoch it returned to last state (in terms of acc and loss).
Note than I just saved and loaded the weights.

mohsal on 28 Dec 2018

It is sad that such a basic issue has still not been solved.

It is happening because model.save(filename.h5) does not save the state of the optimizer. So the optimizers like Adam, RMSProp does not work but SGD works as mentioned in one of the previous comments (I verified this) since it is stateless optimizer (learning rate is fixed).

This is just sad that such a popular library has such basic/glaring/trivial bugs/problems :(
@champnaman

What states of Adam, RMSProp did you refer to? The weights(states) of the optimizers such as RMPSProp and Adam are saved except tensorflow optimizers TFOptimizer if you look at save_model() in saving.py (mine is keras 2.1 which differs from the current code but even the old version saves the optimizer states), which is confirmed by @fchollet in this issue. And the tensorflow keras' save_model document also confirmed this.

And the learning rate (lr) of that epoch along with epsilon, rho, the whole optimizer instance are saved as well. load_model() in saving.py loads those hyperparemeters successfully in my case. However, my loaded loss is very different than the saved model, which is a problem that I'm still investigating. It could be related to the problem with multiple GPUs which is beyond the scope of this issue.

nicolefinnie on 9 Jan 2019

@trane293 You mentioned that you used pickle to save and load the model which worked very well here. Saving the model implies saving weights of model, states of optimizers, custom objects like layers, custom loss function, custom accuracy metrics and more..

Can you please share your code to save and load model. I too am facing this issue and in an urgent need to finding an alternative for now. My models takes approx 10 days to get trained perfectly but because of electricity cutoff, it's training got interrupted. Please help..

Also, I have opened a new issue in keras. Can you help be debug the problem in s step-by-step manner? It would be a great help to a newbie like me. Thank you..

Issue link - https://github.com/keras-team/keras/issues/12263

ParikhKadam on 13 Feb 2019

@trane293 You mentioned that the issue is fixed here. But I am still facing the issue. I am saving my model using ModelCheckpoint callback in keras. Is it like the issue is fixed in model.save() but not in ModelCheckpoint?

I am also using the function multi_gpu_model() for training. Is it interfering? Can you please help with my issue mentioned in above comment?

ParikhKadam on 13 Feb 2019

I experienced this issue with Keras both with Mxnet and Tensorflow backends. My solution was to switch from using keras to tensorflow.keras. This obviously only works with tensorflow backend. However, if you are already using tensorflow backend, it is just a matter of changing your import statements as the functionality of tensorflow.keras is almost identical to keras
Since switching I have not experienced this annoying bug

macmatt22 on 11 Mar 2019

😕1

Thanks, I'll try the solution this week.

ruiyuanlu on 12 Mar 2019

This issue is not yet fixed. I’m experiencing this issue with Tensorflow as backend.any idea?

15a15a on 6 Apr 2019

I have the same issue.Though I can't resume training after load_weight and loss value is just like the epoch1, I can load the weight and predict well.
And i found that the loss will down quickly after load_weight.
In the normal case, loss from 3 down to 1.5 maybe 10 epochs.
But when I call function load_weight, loss from 3 down to 1.5 maybe 3 or 4 epochs.
I have search a solution for a long time and not fix it yet.
Fortunately, It predicts well.

jiayiwang5 on 8 Apr 2019

After loading your weights, when you train your model set parameter initial_epoch to the last epoch you trained your model before. E.g. you trained your model 100 epochs and saved via ModelCheckpoint weights after each epoch and want to resume training from 101st epoch you should do it in the next way
model.load_weights('path_to_the_last_weights_file')
model.fit(initial_epoch=100)
Other parameters keep the same.

turb0bur on 3 Jun 2019

👍1

Also experiencing this, especially with multi_gpu_model of a multi-model (model of models), saving the original multi-model. When I load weights it's as if they've never been saved (although load isn't erroring).

I'm using an "altmodelcheckpoint" to save the weights of the original model. Not sure if it is working.
When checking val loss I get the exact same patterns, as if the weights have been reinitialized...

I think there might actually be a bug in here somewhere.

veqtor on 14 Jun 2019

There is no more an issue of saving/loading model weights in Keras. I know as I am a Keras user. It some error in your program which leads to such issues.

ParikhKadam on 15 Jun 2019

I'll try to find reproduction steps.... But if this bug shows up only in rater circumstances then it's still a bug. But, with 1.14 we'll no longer use multi_gpu_model so it doesn't really matter

veqtor on 15 Jun 2019

@veqtor You can check my mini project. I faced same issues when built it but now everything works fine. I added support for multi gpu and that too is working. Check scripts here - https://github.com/ParikhKadam/bidaf-keras

ParikhKadam on 16 Jun 2019

model.fit(x_train, y_train,batch_size=batch_size,initial_epoch=30,
epochs=epochs,validation_data=(x_test, y_test),
callbacks=[modelcheckpoint, earlystopping])
model.fit(x_train, y_train,batch_size=batch_size,
epochs=epochs,
validation_data=(x_test, y_test),
callbacks=[modelcheckpoint, earlystopping])
添加一个initial_epoach,表示之前训练的次数，我之前只写了一次fit，然后程序只是输出，没有继续训练，后面又添加了一个fit，才开始继续训练的。

jkfy on 19 Dec 2019

My solution was to switch from using keras to tensorflow.keras. This obviously only works with tensorflow backend.
Since switching I have not experienced this annoying bug

Here's my scripts

 from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau, EarlyStopping

saver = ModelCheckpoint(os.path.join(save_path, model_name + '-{epoch:02d}-{val_loss:.2f}.hdf5'),
                                                monitor='val_loss',
                                                verbose=0,
                                                save_best_only=False,
                                                save_weights_only=False,
                                                mode='auto',
                                                save_freq=save_interval)

still not working... And I have trained one epoch but it seems not speed up the speed of convergence

igo312 on 23 Mar 2020

My solution was to switch from using keras to tensorflow.keras. This obviously only works with tensorflow backend.
Since switching I have not experienced this annoying bug

Here's my scripts
 from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard, ReduceLROnPlateau, EarlyStopping

saver = ModelCheckpoint(os.path.join(save_path, model_name + '-{epoch:02d}-{val_loss:.2f}.hdf5'),
                                                monitor='val_loss',
                                                verbose=0,
                                                save_best_only=False,
                                                save_weights_only=False,
                                                mode='auto',
                                                save_freq=save_interval)
still not working... And I have trained one epoch but it seems not speed up the speed of convergence

my sulotion seems working well.Here's my repository. you can try to save weights only then rebuild the network and load the weights .

jiayiwang5 on 23 Mar 2020

As @peymanrah and @turb0bur said, I setting the initial_epoch=39 where my train_eopch paused in tensorboard. And lr=0.008 which didn't change, I tried reducing lr, but I don't give it enough time to train a few epochs. here's my tensorboard visualization, We can see after one epoch, the network seems back on track.
But unfortunately, after load_weights the model still cannot predict.

epoch_<a href="loss:blue">loss:blue</a> line is validation trend, orange line is train trend
epoch_loss:blue line is validation trend, orange line is training trend

igo312 on 24 Mar 2020

great！

usher123 on 10 Apr 2020

Keras: Not able to resume training after loading model + weights

Most helpful comment

All 90 comments

Encoder part

Decoder part

Final Model

Save the final model

serialize weights to HDF5

Related issues