I tired to use random_seed in the solver to create a deterministic output result, but to no avail.
here is a very simple solver demonstrating the cause :
test_iter: 200
test_interval: 500
base_lr: 0.1
display: 100
max_iter: 40000
lr_policy: "multistep"
gamma: 0.1
momentum: 0.9
weight_decay: 0.001
snapshot: 500
snapshot_prefix: "examples/cifar10/cif"
solver_mode: GPU
device_id: 0
random_seed: 1201
net: "examples/cifar10/cifar10_full_relu_train_test_bn.prototxt"
delta: 0.001
stepvalue: 4000
stepvalue: 6500
stepvalue: 9500
stepvalue: 12000
stepvalue: 15000
stepvalue: 18000
type: "AdaDelta"
And this is how the result differ for each run:

log1:
I0520 22:56:57.964998 8952 solver.cpp:280] Learning Rate Policy: multistep
I0520 22:56:57.965998 8952 solver.cpp:337] Iteration 0, Testing net (#0)
I0520 22:56:58.506774 8952 blocking_queue.cpp:50] Data layer prefetch queue empty
I0520 22:56:58.529290 10292 blocking_queue.cpp:50] Waiting for data
I0520 22:56:58.765458 8952 solver.cpp:404] Test net output #0: accuracy = 0.0994001
I0520 22:56:58.765458 8952 solver.cpp:404] Test net output #1: loss = 78.6553 (* 1 = 78.6553 loss)
I0520 22:56:58.817082 8952 solver.cpp:228] Iteration 0, loss = 2.37327
I0520 22:56:58.817082 8952 solver.cpp:244] Train net output #0: loss = 2.37327 (* 1 = 2.37327 loss)
I0520 22:56:58.817082 8952 sgd_solver.cpp:106] Iteration 0, lr = 0.1
I0520 22:57:01.425292 8952 solver.cpp:228] Iteration 100, loss = 1.84951
I0520 22:57:01.425292 8952 solver.cpp:244] Train net output #0: loss = 1.84951 (* 1 = 1.84951 loss)
I0520 22:57:01.425292 8952 sgd_solver.cpp:106] Iteration 100, lr = 0.1
I0520 22:57:04.037649 8952 solver.cpp:228] Iteration 200, loss = 1.83825
I0520 22:57:04.037649 8952 solver.cpp:244] Train net output #0: loss = 1.83825 (* 1 = 1.83825 loss)
I0520 22:57:04.037649 8952 sgd_solver.cpp:106] Iteration 200, lr = 0.1
I0520 22:57:06.640002 8952 solver.cpp:228] Iteration 300, loss = 1.65452
I0520 22:57:06.640002 8952 solver.cpp:244] Train net output #0: loss = 1.65452 (* 1 = 1.65452 loss)
I0520 22:57:06.640002 8952 sgd_solver.cpp:106] Iteration 300, lr = 0.1
I0520 22:57:09.276876 8952 solver.cpp:228] Iteration 400, loss = 1.49125
I0520 22:57:09.276876 8952 solver.cpp:244] Train net output #0: loss = 1.49125 (* 1 = 1.49125 loss)
I0520 22:57:09.276876 8952 sgd_solver.cpp:106] Iteration 400, lr = 0.1
I0520 22:57:11.879228 8952 solver.cpp:454] Snapshotting to binary proto file examples/cifar10/cif_iter_500.caffemodel
I0520 22:57:11.891237 8952 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/cifar10/cif_iter_500.solverstate
I0520 22:57:11.892738 8952 solver.cpp:337] Iteration 500, Testing net (#0)
I0520 22:57:12.583729 8952 solver.cpp:404] Test net output #0: accuracy = 0.4375
I0520 22:57:12.583729 8952 solver.cpp:404] Test net output #1: loss = 1.51292 (* 1 = 1.51292 loss)
I0520 22:57:12.594238 8952 solver.cpp:228] Iteration 500, loss = 1.59907
I0520 22:57:12.594238 8952 solver.cpp:244] Train net output #0: loss = 1.59907 (* 1 = 1.59907 loss)
I0520 22:57:12.594238 8952 sgd_solver.cpp:106] Iteration 500, lr = 0.1
I0520 22:57:15.189584 8952 solver.cpp:228] Iteration 600, loss = 1.40691
I0520 22:57:15.189584 8952 solver.cpp:244] Train net output #0: loss = 1.40691 (* 1 = 1.40691 loss)
I0520 22:57:15.189584 8952 sgd_solver.cpp:106] Iteration 600, lr = 0.1
I0520 22:57:17.790434 8952 solver.cpp:228] Iteration 700, loss = 1.48115
I0520 22:57:17.790434 8952 solver.cpp:244] Train net output #0: loss = 1.48115 (* 1 = 1.48115 loss)
I0520 22:57:17.790434 8952 sgd_solver.cpp:106] Iteration 700, lr = 0.1
I0520 22:57:20.370769 8952 solver.cpp:228] Iteration 800, loss = 1.35165
I0520 22:57:20.370769 8952 solver.cpp:244] Train net output #0: loss = 1.35165 (* 1 = 1.35165 loss)
I0520 22:57:20.370769 8952 sgd_solver.cpp:106] Iteration 800, lr = 0.1
I0520 22:57:22.956110 8952 solver.cpp:228] Iteration 900, loss = 1.20441
I0520 22:57:22.956110 8952 solver.cpp:244] Train net output #0: loss = 1.20441 (* 1 = 1.20441 loss)
I0520 22:57:22.956110 8952 sgd_solver.cpp:106] Iteration 900, lr = 0.1
I0520 22:57:25.603274 8952 solver.cpp:454] Snapshotting to binary proto file examples/cifar10/cif_iter_1000.caffemodel
I0520 22:57:25.611498 8952 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/cifar10/cif_iter_1000.solverstate
I0520 22:57:25.612999 8952 solver.cpp:337] Iteration 1000, Testing net (#0)
I0520 22:57:26.383548 8952 solver.cpp:404] Test net output #0: accuracy = 0.5235
I0520 22:57:26.383548 8952 solver.cpp:404] Test net output #1: loss = 1.30647 (* 1 = 1.30647 loss)
I0520 22:57:26.395056 8952 solver.cpp:228] Iteration 1000, loss = 1.24928
I0520 22:57:26.395056 8952 solver.cpp:244] Train net output #0: loss = 1.24928 (* 1 = 1.24928 loss)
I0520 22:57:26.395056 8952 sgd_solver.cpp:106] Iteration 1000, lr = 0.1
I0520 22:57:29.094476 8952 solver.cpp:228] Iteration 1100, loss = 1.15243
I0520 22:57:29.094976 8952 solver.cpp:244] Train net output #0: loss = 1.15243 (* 1 = 1.15243 loss)
log2:
I0520 23:00:23.213491 9848 solver.cpp:280] Learning Rate Policy: multistep
I0520 23:00:23.213491 9848 solver.cpp:337] Iteration 0, Testing net (#0)
I0520 23:00:23.616646 9848 blocking_queue.cpp:50] Data layer prefetch queue empty
I0520 23:00:23.618648 10420 blocking_queue.cpp:50] Waiting for data
I0520 23:00:23.970897 9848 solver.cpp:404] Test net output #0: accuracy = 0.0994001
I0520 23:00:23.970897 9848 solver.cpp:404] Test net output #1: loss = 78.6553 (* 1 = 78.6553 loss)
I0520 23:00:24.019619 9848 solver.cpp:228] Iteration 0, loss = 2.37327
I0520 23:00:24.019619 9848 solver.cpp:244] Train net output #0: loss = 2.37327 (* 1 = 2.37327 loss)
I0520 23:00:24.019619 9848 sgd_solver.cpp:106] Iteration 0, lr = 0.1
I0520 23:00:26.558173 9848 solver.cpp:228] Iteration 100, loss = 1.79718
I0520 23:00:26.558173 9848 solver.cpp:244] Train net output #0: loss = 1.79718 (* 1 = 1.79718 loss)
I0520 23:00:26.558173 9848 sgd_solver.cpp:106] Iteration 100, lr = 0.1
I0520 23:00:29.119212 9848 solver.cpp:228] Iteration 200, loss = 1.79569
I0520 23:00:29.119212 9848 solver.cpp:244] Train net output #0: loss = 1.79569 (* 1 = 1.79569 loss)
I0520 23:00:29.119212 9848 sgd_solver.cpp:106] Iteration 200, lr = 0.1
I0520 23:00:31.674723 9848 solver.cpp:228] Iteration 300, loss = 1.62069
I0520 23:00:31.674723 9848 solver.cpp:244] Train net output #0: loss = 1.62069 (* 1 = 1.62069 loss)
I0520 23:00:31.674723 9848 sgd_solver.cpp:106] Iteration 300, lr = 0.1
I0520 23:00:34.230991 9848 solver.cpp:228] Iteration 400, loss = 1.42211
I0520 23:00:34.230991 9848 solver.cpp:244] Train net output #0: loss = 1.42211 (* 1 = 1.42211 loss)
I0520 23:00:34.230991 9848 sgd_solver.cpp:106] Iteration 400, lr = 0.1
I0520 23:00:36.764794 9848 solver.cpp:454] Snapshotting to binary proto file examples/cifar10/cif_iter_500.caffemodel
I0520 23:00:36.780421 9848 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/cifar10/cif_iter_500.solverstate
I0520 23:00:36.780421 9848 solver.cpp:337] Iteration 500, Testing net (#0)
I0520 23:00:37.483541 9848 solver.cpp:404] Test net output #0: accuracy = 0.4521
I0520 23:00:37.483541 9848 solver.cpp:404] Test net output #1: loss = 1.48877 (* 1 = 1.48877 loss)
I0520 23:00:37.499166 9848 solver.cpp:228] Iteration 500, loss = 1.59037
I0520 23:00:37.499166 9848 solver.cpp:244] Train net output #0: loss = 1.59037 (* 1 = 1.59037 loss)
I0520 23:00:37.499166 9848 sgd_solver.cpp:106] Iteration 500, lr = 0.1
I0520 23:00:40.057332 9848 solver.cpp:228] Iteration 600, loss = 1.29694
I0520 23:00:40.057332 9848 solver.cpp:244] Train net output #0: loss = 1.29694 (* 1 = 1.29694 loss)
I0520 23:00:40.057332 9848 sgd_solver.cpp:106] Iteration 600, lr = 0.1
I0520 23:00:42.600935 9848 solver.cpp:228] Iteration 700, loss = 1.42686
I0520 23:00:42.600935 9848 solver.cpp:244] Train net output #0: loss = 1.42686 (* 1 = 1.42686 loss)
I0520 23:00:42.600935 9848 sgd_solver.cpp:106] Iteration 700, lr = 0.1
I0520 23:00:45.152452 9848 solver.cpp:228] Iteration 800, loss = 1.38337
I0520 23:00:45.152452 9848 solver.cpp:244] Train net output #0: loss = 1.38337 (* 1 = 1.38337 loss)
I0520 23:00:45.152452 9848 sgd_solver.cpp:106] Iteration 800, lr = 0.1
I0520 23:00:47.699352 9848 solver.cpp:228] Iteration 900, loss = 1.06408
I0520 23:00:47.699352 9848 solver.cpp:244] Train net output #0: loss = 1.06408 (* 1 = 1.06408 loss)
I0520 23:00:47.699352 9848 sgd_solver.cpp:106] Iteration 900, lr = 0.1
I0520 23:00:50.236433 9848 solver.cpp:454] Snapshotting to binary proto file examples/cifar10/cif_iter_1000.caffemodel
I0520 23:00:50.252058 9848 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/cifar10/cif_iter_1000.solverstate
I0520 23:00:50.252058 9848 solver.cpp:337] Iteration 1000, Testing net (#0)
I0520 23:00:50.955193 9848 solver.cpp:404] Test net output #0: accuracy = 0.5476
I0520 23:00:50.955193 9848 solver.cpp:404] Test net output #1: loss = 1.23861 (* 1 = 1.23861 loss)
I0520 23:00:50.955193 9848 solver.cpp:228] Iteration 1000, loss = 1.27772
I0520 23:00:50.955193 9848 solver.cpp:244] Train net output #0: loss = 1.27772 (* 1 = 1.27772 loss)
I0520 23:00:50.955193 9848 sgd_solver.cpp:106] Iteration 1000, lr = 0.1
Are you using cuDNN? If so, that library may not be deterministic (someone correct me if I'm wrong).
Yes I'm using Cudnn.
So would changing the engine type do the trick? e.g. changing the engine to Caffe? (if so should I be changing all layers engine ? or no only the convolution or pooling layers?
CuDNN max pooling and convolutions are nondeterministic (IIRC algo_t = 1 for for fwd/bwd_data/bwd_filter is guaranteed deterministic). This is documented in the PDF documentation distributed with CuDNN. You should switch to engine: CAFFE for all layers where you need determinism.
@ajtulloch Thank you.
how does the performance change if we switch to caffe engine? will it degrade extremely bad like cpu performance ?
@Coderx7 I think it will be around 25-50% slower, depending on which model you are using. It will still be much faster than CPU performance.
You can also try setting fwd_algo[i] = 1; bwd_*_algo[i] = 1; in CuDNNConvolutionLayer::Reshape to get a deterministic variant of CuDNN which is a bit faster than engine: CAFFE.
Thanks alot everyone :)
@Coderx7 Hi, when you set engine 'caffe' or modify fwd_algo, Does it hurt accuracy?
@ujsyehao Hi, I'm not using Caffe anymore(that was 2 years ago!), I'm using Pytorch now and highly recommend it!
Hi, sorry that I was late for the conversation.
If you are setting random_seed in caffe solver, then you need to type this also in your Python script:
import numpy as np; np.random.seed(seed);
Here, the value of seed should be same as the of the value of seed you had used in your caffe solver.
If not Python, but you are coding in C++, then you need to have a look at something similar to np.random.seed(seed) to be written in your C++ script. Now, you can see same accuracy or loss every time. I think the code in C++ is something like 'google.seed(seed)'. I have forgotten the exact coding command in C++.
Most helpful comment
CuDNN max pooling and convolutions are nondeterministic (IIRC algo_t = 1 for for fwd/bwd_data/bwd_filter is guaranteed deterministic). This is documented in the PDF documentation distributed with CuDNN. You should switch to
engine: CAFFEfor all layers where you need determinism.