Keras: Saved model behaves differently on different machines

Created on 17 Aug 2017 · 27Comments · Source: keras-team/keras

After studying #439, #2228, #2743, #6737 and the new FAQ about reproducibility, I was able to get consistent, reproducible result on my development machines using Theano. If I run my code twice, I get the exact same results.

The problem is that the results are reproducible only on the same machine. In other words, if I

Train a model on machine A
Evaluate the model using predict
Save the model (using save_model, or model_to_json and save_weights)
Transfer the model to machine B and load it
Evaluate again the model on machine B using predict

The results of the two predicts are different. Using CPU or GPU makes no difference - after I copy the model file(s) from a machine to another, the performance of predict changes dramatically.

The only difference on the two machines is the hardware (I use my laptop's 980M and a workstation with a Titan X Pascal) and the NVIDIA driver version, which is slightly older on the workstation. Both computers run Ubuntu 16.04 LTS and Cuda 8 with cuDNN. All libraries are on the same version on both machines, and the Python version is the same as well (3.6.1).

Is this behavior intended? I expect that running a pre-trained model on with the same architecture and weights on two different machines yields the same results, but this doesn't seem the case.

On a side note, a suggestion: on the FAQs about reproducibility, it should be explicitly stated that the development version of Theano is needed to get reproducible results.

Source

basaldella

👍8

Most helpful comment

@basaldella Have you fixed this issue?
It seems that I have the same problem. I re-traind a model with fine tuning InceptionV3 with my own images on a GPU machine. After training, the accuracy could up to 91% which I am happy with it. During the training the improved model was saved with callbacks. So I can load the best retrained model with model.load_model(model_path), and I tested it with one image. The predict results are always the same and correct (because I know what this image belong to).
the results is like this: [[ 0.00197385 0.01141251 0.02262068 0.9121536 0.00810914 0.01657074
0.00370198 0.00617629 0.00972648 0.00531203 0.00224261]]

Now, I try to copy the retrained model (HDF5 file) to my laptop, and load the model again, and test the model with the same image, then I got a totally different result.
[[ 0.00373867 0.22160383 0.10066977 0.35440436 0.02839879 0.17799987
0.01744748 0.02645957 0.0299265 0.03026218 0.00908909]]

The python environment are the same in the two machine with keras 2.0.8:
The result are always the same in the same machine.
The weights are the same after I load the model file.
......I checked many things.

Why the results are different in the two machine? Is there somebody know about this?

wangchenouc on 24 Oct 2017

👍4

All 27 comments

same here with Tensorflow backend. I trained my models in my local machine (Ubuntu 16.04 LTS). When I tested my model on AWS EC2 instance I got different prediction numbers.

Have you solved this?

RunshengSong on 30 Aug 2017

👍3

No, I'm still experiencing this problem.

You can check if any of the solutions posted in #4875 work for you. What versions of the libraries are you running?

basaldella on 30 Aug 2017

👍1

I am running Keras 2.0.6.

From my experience the version of packages is not the problem (at least for now). I did some experiments today and found out that if I remove PCA from my modeling pipeline everything works fine. Did you use sklearn's PCA to reduce the dimension of your inputs? If so you might want to try to remove it. It solved my problem for now.

I don't know why this happens. This post says that the non-deterministic in weights could cause problems, but this doesn't explain why the Neural Networks model behaves the same in the same machine. Inspired by this post I guess the different results of dimension reduction lead to this problem.

RunshengSong on 31 Aug 2017

Not sure if this will help at all, but I was dealing with this for a day and a half before I realized it was a difference in how the machines handled hashing words before I passed them to my Embedding layer.

rsmith49 on 31 Aug 2017

👍1

@RunshengSong, I don't use sklearn. @rsmith49, are you talking about setting PYTHONHASHSEED?

basaldella on 1 Sep 2017

I found the same thing as @rsmith49 - had a word2vec model that would act as if the weights had been completely re-initialized when i loaded them from disk in a new session. After also saving/pickling the dicts that mapped words to ints and reloading them from disk also when i started a new training session the model behaved as expected.

halhenke on 1 Sep 2017

@basaldella Yes, turns out my issue was more along the lines of #4875, and was inconsistent between different Python sessions, not just different machines.

rsmith49 on 1 Sep 2017

👍2

@halhenke I'm also using a word2vec model, but I use GloVe's pre-trained weights following this tutorial, so I guess that shouldn't be the issue. Are you using pre-trained weights as well?

basaldella on 13 Sep 2017

Why the results are different in the two machine? Is there somebody know about this?

wangchenouc on 24 Oct 2017

👍4

@wangchenouc no, I was absolutely not able to fix this issue. If you have any news please tag me in your issue as well. I'm actually thinking on switching to a lower level framework because I'm not able to solve this problem.

basaldella on 24 Oct 2017

👍1

@basaldella Please look at this #8149

wangchenouc on 25 Oct 2017

@wangchenouc thanks, but this does not solve my issue. In my case the versions of Keras are the same on both machines.

basaldella on 25 Oct 2017

@basaldella Just compare the version of keras is not enough. Maybe you need to compare every function codes that you used. Try use a very simple cases like what I did, and it's easy to compare your different step by step. I spend 3 whole days to debug the codes step by step, and solved my problem finally.

Good luck to you!

wangchenouc on 26 Oct 2017

👍2

@wangchenouc I know. But I'm cloning the same repo on 2 machines, installing the same python&libraries versions with a script, and still, I have no luck on getting the same results.

Thanks for encouraging me though :)

basaldella on 26 Oct 2017

Any news on this issue? I'm running into the same problem.
Two instances - identical because it's the same hardware set up and the second has been installed from an image of the first one.
When I model.save() my model in the first instance and load_model() in the second, the results seem to be random when evaluating in the second instance. Accuracy also drops to unreasonable values (from .97 to .52).

What are the possible causes other than differences in code/set up/hardware? I've been searching for solutions for the last 3 days and nothing seems to work.

Philipduerholt on 12 Jan 2018

I've looked at the several potential solutions reported here and related threads but no luck either @Philipduerholt . In my case my last layer is a softmax and when I predict the same training data (not even test), I get equal probabilities between my classes i.e. the model is completely random.

dterg on 30 Jan 2018

👍1

It worked, finally. And in hindsight it looks simple. I'll try to include all relevant points:
I'll be talking about training instance and production instance.
I use TensorFlow backend.
Python version 3.5.4.
Keras version 2.0.5.

I pickle everything I use as input or as ID-map (like word_id_dictionaries).
e.g. at the end I have word2index.dict, label2index.dict (most people would use .pkl).
For evaluation I also pickle X_test and y_test.
Build and train your model in training instance.
I use a Sequencial model with Embedding layer (some had problems with that).
I use ModelCheckpoint() and save it as a list (callbacks_list), file names end on .hdf5.
model.fit has callbacks = callbacks_list.
After training, I choose the most promising saved model.
I can load_model('models/most_promising.hdf5') in the same instance and evaluate.
This works as expected.
I transfer the .hdf5 file and all pickled files to production instance.
In production I make sure all package versions are equal to training instance.
Best to use something like conda env.
I import: from keras import backend as K
Immediately after imports i set learning phase: K.set_learning_phase(0)
I initialize/load all the things:
- model = load_model(model_path)
- with open(word2index_path, "rb") as f:
  
  word2index = pickle.load(f)
- etc.
Evaluation works as expected.
Predict works as expected.

I hope it helps.

Philipduerholt on 31 Jan 2018

I had the same problem...
upgrading Keras on both machines to version 2.1.5 solved problem for me.

ghost on 21 Mar 2018

I had the same problem...
upgrading Keras on both machines to version 2.1.5 solved problem for me.
Amazing！the solution is each machine should have the same version keras！Same inputs on differernt version will have different output....

han963xiao on 10 Nov 2018

running into the same problem and looking for solution

xiaoleitw on 23 Feb 2019

It worked, finally. And in hindsight it looks simple. I'll try to include all relevant points:
I'll be talking about training instance and production instance.
I use TensorFlow backend.
Python version 3.5.4.
Keras version 2.0.5.

I pickle everything I use as input or as ID-map (like word_id_dictionaries).

e.g. at the end I have word2index.dict, label2index.dict (most people would use .pkl).

For evaluation I also pickle X_test and y_test.

Build and train your model in training instance.

I use a Sequencial model with Embedding layer (some had problems with that).

I use ModelCheckpoint() and save it as a list (callbacks_list), file names end on .hdf5.

model.fit has callbacks = callbacks_list.

After training, I choose the most promising saved model.

I can load_model('models/most_promising.hdf5') in the same instance and evaluate.

This works as expected.

I transfer the .hdf5 file and all pickled files to production instance.

In production I make sure all package versions are equal to training instance.

Best to use something like conda env.

I import: from keras import backend as K

Immediately after imports i set learning phase: K.set_learning_phase(0)

I initialize/load all the things:

model = load_model(model_path)

with open(word2index_path, "rb") as f:
word2index = pickle.load(f)

etc.

Evaluation works as expected.

Predict works as expected.

I hope it helps.

this is not working for me :(

xiaoleitw on 23 Feb 2019

👍1

tf.set_random_seed(0) worked for me

shrutim90 on 12 Mar 2019

tf.set_random_seed(0) worked for me

where should this line be placed? Before sess = tf.Session(config=config)?

jewelcai on 30 Jul 2019

I am facing same problem in Golang, following are approach

Train a model on Ubuntu 18.04 (using Python, Tensorflow and Keras)
Optimized, Freezed and Saved model to be used in Tensorflow Go API
LoadSavedModel on Ubuntu 18.04 using Tensorflow Go API
LoadSavedModel on Raspberry Pi 4 using Tensorflow Go API

The weights for all layer's are different, when loaded in Ubuntu (step 3) and Raspberry Pi (step 4). Which is causing the different softmax prediction.

Sample weights on different environment:
These are just sample weights, however all the weights in all layers are different.
Tensorflow API version used to load model: Go Tensorflow (r2.0), Tensorflow C (r2.0), Golang (1.13.6)

Loaded weights in Ubuntu:
[0.5031438 -0.062892914 -0.10482144 -0.04192853 0.7127869 0.46121502 -0.3983221 ....]

Loaded weight in Raspberry Pi for the same layer and same filter as above
[0.49415612 -0.07188058 -0.11380911 -0.050916195 0.70379925 0.45222735 -0.40730977 ....]

How to solve this?

urmilanayak on 11 Jun 2020

I am facing the same issue with this but in abit weird fashion.
Long story short. I had setup 3 pipeline.

1) Train pipeline -- train on Azure ML with TF 2.0 - NC6s V2 (Cloud VM) --- training ok
2) Testing pipeline -- testing on a local machine with TF.2.3 RTX2070 --- predict results ok
3) Deployment pipeline -- NC6s V2 for inference with TF2.3. (Cloud VM) --- erractic behavior of model.predict

Pipeline 2 and 3, the environment is kept the same with the same code. The only difference is the hardware and GPU.
What baffled me is that the results prediction on local machine was as expected but when deployed in the cloud vm, it sometimes work, sometimes dont. What is even more weird is that if I run a inference on a few images in a sequence -- let say [image1, image2, image3] -- image1 and image3 would predict ok, but image2 would not have a complete prediction. For image2, most part of the prediction is working except for the last few tiles of the image.

I am at a lost here because i dont know where to start debugging and I cant just spin up my VM to test as it cost money. I am not sure if it is related to some memory issue or weights initlaization etc. Anyone has any pointers?

alankongfq on 30 Aug 2020

👍2

@alankongfq : Not even going that far, I found out after a whole day of debugging that my tf 2.1 model gives different predictions when run on CPU vs GPU keeping EVERYTHING else same (same machine, same OS, fixed saved weights, no randomness anywhere). I knew there are precision differences between the two devices, didn't realize they can be so significant. I think it has to do with particular NN architectures as well. With lots of parameters, sometimes a little error in each parameter accumulates to a BIG error in final predictions. The first point of the first answer to this SO question tries to tell the same thing- https://stackoverflow.com/questions/43221730/tensorflow-same-code-but-get-different-result-from-cpu-device-to-gpu-device
This guy also linked some closed Tensorflow GitHub issues which conclude that this is the expected behavior and not a bug. Hope this helps.