Keras: mnist_cnn.py does not give reproducible results

Created on 23 Apr 2016 · 15Comments · Source: keras-team/keras

If I run python mnist_cnn.py twice I get different results. Conceptually, I have no clue what might be going on here. Here are two example outputs:

Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is enabled with initial size: 75.0% of memory, CuDNN 4007)
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 25s - loss: 0.2538 - acc: 0.9227 - val_loss: 0.0534 - val_acc: 0.9832
Epoch 2/12
60000/60000 [==============================] - 24s - loss: 0.0945 - acc: 0.9714 - val_loss: 0.0378 - val_acc: 0.9876
Epoch 3/12
60000/60000 [==============================] - 24s - loss: 0.0704 - acc: 0.9787 - val_loss: 0.0355 - val_acc: 0.9883
Epoch 4/12
60000/60000 [==============================] - 24s - loss: 0.0584 - acc: 0.9830 - val_loss: 0.0331 - val_acc: 0.9893
Epoch 5/12
60000/60000 [==============================] - 24s - loss: 0.0489 - acc: 0.9848 - val_loss: 0.0305 - val_acc: 0.9897
Epoch 6/12
60000/60000 [==============================] - 24s - loss: 0.0428 - acc: 0.9870 - val_loss: 0.0315 - val_acc: 0.9901
Epoch 7/12
60000/60000 [==============================] - 24s - loss: 0.0383 - acc: 0.9880 - val_loss: 0.0305 - val_acc: 0.9910
Epoch 8/12
60000/60000 [==============================] - 24s - loss: 0.0373 - acc: 0.9881 - val_loss: 0.0298 - val_acc: 0.9903
Epoch 9/12
60000/60000 [==============================] - 24s - loss: 0.0320 - acc: 0.9901 - val_loss: 0.0286 - val_acc: 0.9911
Epoch 10/12
60000/60000 [==============================] - 24s - loss: 0.0311 - acc: 0.9902 - val_loss: 0.0284 - val_acc: 0.9913
Epoch 11/12
60000/60000 [==============================] - 24s - loss: 0.0282 - acc: 0.9910 - val_loss: 0.0290 - val_acc: 0.9910
Epoch 12/12
60000/60000 [==============================] - 24s - loss: 0.0264 - acc: 0.9916 - val_loss: 0.0296 - val_acc: 0.9910
Test score: 0.0296402487505
Test accuracy: 0.991

Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is enabled with initial size: 75.0% of memory, CuDNN 4007)
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 25s - loss: 0.2543 - acc: 0.9227 - val_loss: 0.0574 - val_acc: 0.9819
Epoch 2/12
60000/60000 [==============================] - 24s - loss: 0.0939 - acc: 0.9719 - val_loss: 0.0403 - val_acc: 0.9869
Epoch 3/12
60000/60000 [==============================] - 24s - loss: 0.0709 - acc: 0.9789 - val_loss: 0.0371 - val_acc: 0.9870
Epoch 4/12
60000/60000 [==============================] - 24s - loss: 0.0584 - acc: 0.9828 - val_loss: 0.0318 - val_acc: 0.9888
Epoch 5/12
60000/60000 [==============================] - 24s - loss: 0.0492 - acc: 0.9850 - val_loss: 0.0292 - val_acc: 0.9900
Epoch 6/12
60000/60000 [==============================] - 24s - loss: 0.0420 - acc: 0.9867 - val_loss: 0.0313 - val_acc: 0.9897
Epoch 7/12
60000/60000 [==============================] - 24s - loss: 0.0393 - acc: 0.9875 - val_loss: 0.0303 - val_acc: 0.9905
Epoch 8/12
60000/60000 [==============================] - 24s - loss: 0.0372 - acc: 0.9883 - val_loss: 0.0293 - val_acc: 0.9914
Epoch 9/12
60000/60000 [==============================] - 24s - loss: 0.0311 - acc: 0.9907 - val_loss: 0.0279 - val_acc: 0.9909
Epoch 10/12
60000/60000 [==============================] - 24s - loss: 0.0319 - acc: 0.9900 - val_loss: 0.0269 - val_acc: 0.9920
Epoch 11/12
60000/60000 [==============================] - 24s - loss: 0.0282 - acc: 0.9914 - val_loss: 0.0283 - val_acc: 0.9913
Epoch 12/12
60000/60000 [==============================] - 24s - loss: 0.0270 - acc: 0.9916 - val_loss: 0.0312 - val_acc: 0.9907
Test score: 0.0312398415336
Test accuracy: 0.9907

I'm on terrible internet right now and can't double check that Keras and Theano are 100% up to date, but pip says I'm at Keras 1.0.1, Theano 0.8.1, numpy 1.11.0.

Running this script twice works as expected:

import numpy
print(numpy.random.random(3))

Source

kylemcdonald

Most helpful comment

@NasenSpray thanks! That was the issue. Back on good internet, I double-checked that numpy, Theano and Keras are 100% up to date, and verified the options to fix this:

Disable cuDNN with THEANO_FLAGS="optimizer_excluding=conv_dnn" python mnist_cnn.py, which takes twice as long at around 60s/epoch instead of 25s/epoch.
Force deterministic behavior in cuDNN with THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py (the flag mentioned in the quoted post has been deprecated in favor of two flags). This actually runs in the same time or slightly faster on my machine, at 24s/epoch consistently.

Going to close this as it's not really a "bug" with Keras itself, so much as different behavior between different libraries. Though it might make sense to remove the np.random.seed(1337) at the top of many of the examples, now that it's clear this line doesn't guarantee anything.

kylemcdonald on 24 Apr 2016

👍8

All 15 comments

Can't reproduce; mnist_cnn is deterministic on my machine with Theano (OSX).

The only sources of randomness in Keras are from Numpy's random module (weight inits) and from Theano (and those are all seeded with Numpy's random), e.g. dropout. Seeding the Numpy RNG should work. Sorry I can't help you further, but this seems specific to your system...

fchollet on 24 Apr 2016

cuDNN's backward pass is by default non-deterministic. See http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html

The documentation of CUDNN tells that, for the 2 following operations, the reproducibility is not guaranteed with the default implementation: cudnnConvolutionBackwardFilter and cudnnConvolutionBackwardData. Those correspond to the gradient wrt the weights and the gradient wrt the input of the convolution. They are also used sometimes in the forward pass, when they give a speed up.
The Theano flag dnn.conv.algo_bwd can be use to force the use of a slower but deterministic convolution implementation.

Maybe that's the reason?

NasenSpray on 24 Apr 2016

@NasenSpray thanks! That was the issue. Back on good internet, I double-checked that numpy, Theano and Keras are 100% up to date, and verified the options to fix this:

Disable cuDNN with THEANO_FLAGS="optimizer_excluding=conv_dnn" python mnist_cnn.py, which takes twice as long at around 60s/epoch instead of 25s/epoch.
Force deterministic behavior in cuDNN with THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py (the flag mentioned in the quoted post has been deprecated in favor of two flags). This actually runs in the same time or slightly faster on my machine, at 24s/epoch consistently.

kylemcdonald on 24 Apr 2016

👍8

@kylemcdonald that doesn't fix random seeding in the cifar10 example to me.

giorgiop on 14 May 2016

@giorgiop to be clear:

you're started with the cifar10 example
you added np.random.seed(1337) near the top of the file
you are either disabling cuDNN or forcing deterministic behavior with the flags mentioned above
you're seeing non-deterministic behavior

is that correct?

kylemcdonald on 14 May 2016

👍1

Actually, that works! I was trying some other code with convolutional layers that have no deterministic behaviour, until I found the trick you explained above. Then I went back on the cifar10 example, which still had the problem. But now I notice that the script is one of the few examples with no random.seed fixed in the folder. Thanks

giorgiop on 14 May 2016

Hi, this might seem trivial, but how would I declare
THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py

in my .theanorc file?

When I do the following

[dnn]
enabled = True
include_path = /usr/local/cuda/include
library_path = /usr/local/cuda/lib64
conv.algo_bwd_filter=deterministic
conv.algo_bwd_data=deterministic

the code is non-deterministic, but when I run
THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py

the code is deterministic. It must be my syntax that is wrong, but I get no errors. Anyone know proper syntax? I would rather have it in .theanorc than set it in my .bashrc

jerpint on 14 Apr 2017

Found the solution to my own problem through trial and error, for those wondering, this is it:

[dnn]
enabled = True
include_path = /usr/local/cuda/include
library_path = /usr/local/cuda/lib64

[dnn.conv]

algo_bwd_data=deterministic
algo_bwd_filter=deterministic

jerpint on 14 Apr 2017

It seems that I solved this problem in this way:
http://blog.csdn.net/qq_33039859/article/details/75452813
step1: fix the numpy random seed at the top of code
step2: be sure that model.fit(shuffle=False)

GuokaiLiu on 19 Jul 2017

👍2

What @GuokaiLiu propose doesn't seem to solve the randomness in my case, neither what it was proposed in https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development . Any clues ?

curiale on 3 Nov 2017

@kylemcdonald How to set these flags for Tensorflow backend?

mrgloom on 24 Nov 2017

Hi,

https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development also doesn't solve my problem. I run the model on GPU and in addition to the inconsistent results (I use my own metric) there is an important difference between them, e.g. run1. 0.878 run2. 0.858 run3. 0.861

my data set is fixed so I'm sure I give the same training and test set to the model, so I can't understand how randomness could affect the performace this much or there's something wrong with my model?

hkmztrk on 12 Dec 2017

@hkmztrk Did you use cuDNN?

Is there any option to switch off cuDNN in keras\tensorflow?

mrgloom on 12 Dec 2017

@mrgloom how do we understand whether we used cuDNN or not? I was assuming tensorflow-gpu uses cuDNN.
Here is my initial summary.

I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 7.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0)

hkmztrk on 12 Dec 2017

Also, after updating my code as follows, I'm having "Exit 139" error in two of my three runs (interpreting as segmentation fault)

import numpy as np
import tensorflow as tf
import random as rn

import os
os.environ['PYTHONHASHSEED'] = '0'

np.random.seed(1)
rn.seed(1)

session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
import keras
from keras import backend as K
tf.set_random_seed(1234)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)

hkmztrk on 12 Dec 2017

Was this page helpful?

0 / 5 - 0 ratings