If I run python mnist_cnn.py twice I get different results. Conceptually, I have no clue what might be going on here. Here are two example outputs:
Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is enabled with initial size: 75.0% of memory, CuDNN 4007)
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 25s - loss: 0.2538 - acc: 0.9227 - val_loss: 0.0534 - val_acc: 0.9832
Epoch 2/12
60000/60000 [==============================] - 24s - loss: 0.0945 - acc: 0.9714 - val_loss: 0.0378 - val_acc: 0.9876
Epoch 3/12
60000/60000 [==============================] - 24s - loss: 0.0704 - acc: 0.9787 - val_loss: 0.0355 - val_acc: 0.9883
Epoch 4/12
60000/60000 [==============================] - 24s - loss: 0.0584 - acc: 0.9830 - val_loss: 0.0331 - val_acc: 0.9893
Epoch 5/12
60000/60000 [==============================] - 24s - loss: 0.0489 - acc: 0.9848 - val_loss: 0.0305 - val_acc: 0.9897
Epoch 6/12
60000/60000 [==============================] - 24s - loss: 0.0428 - acc: 0.9870 - val_loss: 0.0315 - val_acc: 0.9901
Epoch 7/12
60000/60000 [==============================] - 24s - loss: 0.0383 - acc: 0.9880 - val_loss: 0.0305 - val_acc: 0.9910
Epoch 8/12
60000/60000 [==============================] - 24s - loss: 0.0373 - acc: 0.9881 - val_loss: 0.0298 - val_acc: 0.9903
Epoch 9/12
60000/60000 [==============================] - 24s - loss: 0.0320 - acc: 0.9901 - val_loss: 0.0286 - val_acc: 0.9911
Epoch 10/12
60000/60000 [==============================] - 24s - loss: 0.0311 - acc: 0.9902 - val_loss: 0.0284 - val_acc: 0.9913
Epoch 11/12
60000/60000 [==============================] - 24s - loss: 0.0282 - acc: 0.9910 - val_loss: 0.0290 - val_acc: 0.9910
Epoch 12/12
60000/60000 [==============================] - 24s - loss: 0.0264 - acc: 0.9916 - val_loss: 0.0296 - val_acc: 0.9910
Test score: 0.0296402487505
Test accuracy: 0.991
Using Theano backend.
Using gpu device 0: GeForce GT 750M (CNMeM is enabled with initial size: 75.0% of memory, CuDNN 4007)
X_train shape: (60000, 1, 28, 28)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
60000/60000 [==============================] - 25s - loss: 0.2543 - acc: 0.9227 - val_loss: 0.0574 - val_acc: 0.9819
Epoch 2/12
60000/60000 [==============================] - 24s - loss: 0.0939 - acc: 0.9719 - val_loss: 0.0403 - val_acc: 0.9869
Epoch 3/12
60000/60000 [==============================] - 24s - loss: 0.0709 - acc: 0.9789 - val_loss: 0.0371 - val_acc: 0.9870
Epoch 4/12
60000/60000 [==============================] - 24s - loss: 0.0584 - acc: 0.9828 - val_loss: 0.0318 - val_acc: 0.9888
Epoch 5/12
60000/60000 [==============================] - 24s - loss: 0.0492 - acc: 0.9850 - val_loss: 0.0292 - val_acc: 0.9900
Epoch 6/12
60000/60000 [==============================] - 24s - loss: 0.0420 - acc: 0.9867 - val_loss: 0.0313 - val_acc: 0.9897
Epoch 7/12
60000/60000 [==============================] - 24s - loss: 0.0393 - acc: 0.9875 - val_loss: 0.0303 - val_acc: 0.9905
Epoch 8/12
60000/60000 [==============================] - 24s - loss: 0.0372 - acc: 0.9883 - val_loss: 0.0293 - val_acc: 0.9914
Epoch 9/12
60000/60000 [==============================] - 24s - loss: 0.0311 - acc: 0.9907 - val_loss: 0.0279 - val_acc: 0.9909
Epoch 10/12
60000/60000 [==============================] - 24s - loss: 0.0319 - acc: 0.9900 - val_loss: 0.0269 - val_acc: 0.9920
Epoch 11/12
60000/60000 [==============================] - 24s - loss: 0.0282 - acc: 0.9914 - val_loss: 0.0283 - val_acc: 0.9913
Epoch 12/12
60000/60000 [==============================] - 24s - loss: 0.0270 - acc: 0.9916 - val_loss: 0.0312 - val_acc: 0.9907
Test score: 0.0312398415336
Test accuracy: 0.9907
I'm on terrible internet right now and can't double check that Keras and Theano are 100% up to date, but pip says I'm at Keras 1.0.1, Theano 0.8.1, numpy 1.11.0.
Running this script twice works as expected:
import numpy
print(numpy.random.random(3))
Can't reproduce; mnist_cnn is deterministic on my machine with Theano (OSX).
The only sources of randomness in Keras are from Numpy's random module (weight inits) and from Theano (and those are all seeded with Numpy's random), e.g. dropout. Seeding the Numpy RNG should work. Sorry I can't help you further, but this seems specific to your system...
cuDNN's backward pass is by default non-deterministic. See http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html
The documentation of CUDNN tells that, for the 2 following operations, the reproducibility is not guaranteed with the default implementation: cudnnConvolutionBackwardFilter and cudnnConvolutionBackwardData. Those correspond to the gradient wrt the weights and the gradient wrt the input of the convolution. They are also used sometimes in the forward pass, when they give a speed up.
The Theano flag dnn.conv.algo_bwd can be use to force the use of a slower but deterministic convolution implementation.
Maybe that's the reason?
@NasenSpray thanks! That was the issue. Back on good internet, I double-checked that numpy, Theano and Keras are 100% up to date, and verified the options to fix this:
THEANO_FLAGS="optimizer_excluding=conv_dnn" python mnist_cnn.py, which takes twice as long at around 60s/epoch instead of 25s/epoch.THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py (the flag mentioned in the quoted post has been deprecated in favor of two flags). This actually runs in the same time or slightly faster on my machine, at 24s/epoch consistently.Going to close this as it's not really a "bug" with Keras itself, so much as different behavior between different libraries. Though it might make sense to remove the np.random.seed(1337) at the top of many of the examples, now that it's clear this line doesn't guarantee anything.
@kylemcdonald that doesn't fix random seeding in the cifar10 example to me.
@giorgiop to be clear:
np.random.seed(1337) near the top of the fileis that correct?
Actually, that works! I was trying some other code with convolutional layers that have no deterministic behaviour, until I found the trick you explained above. Then I went back on the cifar10 example, which still had the problem. But now I notice that the script is one of the few examples with no random.seed fixed in the folder. Thanks
Hi, this might seem trivial, but how would I declare
THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py
in my .theanorc file?
When I do the following
[dnn]
enabled = True
include_path = /usr/local/cuda/include
library_path = /usr/local/cuda/lib64
conv.algo_bwd_filter=deterministic
conv.algo_bwd_data=deterministic
the code is non-deterministic, but when I run
THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py
the code is deterministic. It must be my syntax that is wrong, but I get no errors. Anyone know proper syntax? I would rather have it in .theanorc than set it in my .bashrc
Found the solution to my own problem through trial and error, for those wondering, this is it:
[dnn]
enabled = True
include_path = /usr/local/cuda/include
library_path = /usr/local/cuda/lib64
[dnn.conv]
algo_bwd_data=deterministic
algo_bwd_filter=deterministic
It seems that I solved this problem in this way:
http://blog.csdn.net/qq_33039859/article/details/75452813
step1: fix the numpy random seed at the top of code
step2: be sure that model.fit(shuffle=False)
What @GuokaiLiu propose doesn't seem to solve the randomness in my case, neither what it was proposed in https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development . Any clues ?
@kylemcdonald How to set these flags for Tensorflow backend?
Hi,
https://keras.io/getting-started/faq/#how-can-i-obtain-reproducible-results-using-keras-during-development also doesn't solve my problem. I run the model on GPU and in addition to the inconsistent results (I use my own metric) there is an important difference between them, e.g. run1. 0.878 run2. 0.858 run3. 0.861
my data set is fixed so I'm sure I give the same training and test set to the model, so I can't understand how randomness could affect the performace this much or there's something wrong with my model?
@hkmztrk Did you use cuDNN?
Is there any option to switch off cuDNN in keras\tensorflow?
@mrgloom how do we understand whether we used cuDNN or not? I was assuming tensorflow-gpu uses cuDNN.
Here is my initial summary.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.683
pciBusID 0000:02:00.0
Total memory: 7.92GiB
Free memory: 7.83GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0)
Also, after updating my code as follows, I'm having "Exit 139" error in two of my three runs (interpreting as segmentation fault)
import numpy as np
import tensorflow as tf
import random as rn
import os
os.environ['PYTHONHASHSEED'] = '0'
np.random.seed(1)
rn.seed(1)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
import keras
from keras import backend as K
tf.set_random_seed(1234)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
Most helpful comment
@NasenSpray thanks! That was the issue. Back on good internet, I double-checked that numpy, Theano and Keras are 100% up to date, and verified the options to fix this:
THEANO_FLAGS="optimizer_excluding=conv_dnn" python mnist_cnn.py, which takes twice as long at around 60s/epoch instead of 25s/epoch.THEANO_FLAGS="dnn.conv.algo_bwd_filter=deterministic,dnn.conv.algo_bwd_data=deterministic" python mnist_cnn.py(the flag mentioned in the quoted post has been deprecated in favor of two flags). This actually runs in the same time or slightly faster on my machine, at 24s/epoch consistently.Going to close this as it's not really a "bug" with Keras itself, so much as different behavior between different libraries. Though it might make sense to remove the
np.random.seed(1337)at the top of many of the examples, now that it's clear this line doesn't guarantee anything.