Hi,
I'm running the MNIST example of tutorial 2 but my CE and error rates do not go down. I can let it run for longer, but the error doesn't seem to change no matter how long I let it train.
I'm running the latest binaries on the website (v1.5). The config file that comes with this distribution is a bit different than the config file on the website. However, it does matter which one I use - The CE just won't go down. I tried increasing the learning rate and even setting it to zero, but nothing seems to change anything. I tried setting makeMode=false, but that doesn't make a difference either.
Any idea on what I'm doing wrong or how I can debug this?
Examples of the output I get.
Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30255688 * 16000; errTop1 = 0.89431250 * 16000; err = 0.89431250 * 16000; time = 1.9488s; samplesPerSecond = 8210.3
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30221021 * 16000; errTop1 = 0.89225000 * 16000; err = 0.89225000 * 16000; time = 0.6881s; samplesPerSecond = 23252.3
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30195801 * 16000; errTop1 = 0.88543750 * 16000; err = 0.88543750 * 16000; time = 0.6977s; samplesPerSecond = 22932.7
Finished Epoch[ 1 of 30]: [Training] ce = 2.30196224 * 60000; errTop1 = 0.89016667 * 60000; err = 0.89016667 * 60000; totalSamplesSeen = 60000; learningRatePerSample = 0.003125; epochTime=3.9701s
SGD: Saving checkpoint model './Output/Models/01_OneHidden.1'
.. And these number remain the same no matter how many epochs are completed:
tarting minibatch loop.
Epoch[28 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30140747 * 16000; errTop1 = 0.88781250 * 16000; err = 0.88781250 * 16000; time = 2.3025s; samplesPerSecond = 6948.8
Epoch[28 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30252905 * 16000; errTop1 = 0.88993750 * 16000; err = 0.88993750 * 16000; time = 0.8579s; samplesPerSecond = 18649.1
Epoch[28 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30222266 * 16000; errTop1 = 0.88981250 * 16000; err = 0.88981250 * 16000; time = 0.8466s; samplesPerSecond = 18900.1
Finished Epoch[28 of 30]: [Training] ce = 2.30194245 * 60000; errTop1 = 0.88906667 * 60000; err = 0.88906667 * 60000; totalSamplesSeen = 1680000; learningRatePerSample = 0.003125; epochTime=4.77177s
SGD: Saving checkpoint model './Output/Models/01_OneHidden.28'
Same problem here.
Funny thing is on my laptop everything works fine and I have low error rates. However when I switch to my pc (both CPU only mode ...) I get very high error rates like you :-) ...
Since I have on my PC "special/different" user rights my first guess is that there might be some "read-only" rights issues ... Other than that ... no idea whatsoever
I've got the same issue. Works fine on the GPU version but CPU version of compiled binary doesn't have the error reducing.
Built time: Jun 6 2016 13:12:33
Last modified date: Sat Jun 4 21:28:41 2016
Build type: Release
Build target: CPU-only
With 1bit-SGD: no
Build Branch: HEAD
Build SHA1: b7ed8dc9e5cd8ab35f4badae86dd42e93e9f2564
Built by svcphil on LIANA-09-w
Build Path: c:jenkinsworkspaceCNTK-Build-WindowsSourceCNTK
Was working before (previous version perhaps?). I installed Visual Studio 2013 since then on this machine. Note this exact version works fine on my other machine.
Try reducing learning rate (e.g by a factor of 3) and see if it helps. AFAIR, CNTK team observed this behavior previously on CPU-only runs.
I'm running this on the MNIST one hidden dataset.
Starting Epoch 1: learning rate per sample = 0.003125 effective momentum = 0.000000 momentum as time constant = 0.0 samples
BlockRandomizer::StartEpoch: epoch 0: frames 0..60000, data subset 0 of 1
Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30255688 * 16000; top5Errs = 48.775% * 16000; errs = 89.431% * 16000; time = 1.7538s; samplesPerSecond = 9123.2
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30221021 * 16000; top5Errs = 49.019% * 16000; errs = 89.225% * 16000; time = 0.6833s; samplesPerSecond = 23415.8
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30195801 * 16000; top5Errs = 48.681% * 16000; errs = 88.544% * 16000; time = 0.6961s; samplesPerSecond = 22985.3
and basically stays there forever.
Starting Epoch 1: learning rate per sample = 0.003125 effective momentum = 0.000000 momentum as time constant = 0.0 samples
BlockRandomizer::StartEpoch: epoch 0: frames 0..60000, data subset 0 of 1
Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 1.28410046 * 16000; top5Errs = 9.194% * 16000; errs = 37.681% * 16000; time = 1.7513s; samplesPerSecond = 9135.9
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 0.49985193 * 16000; top5Errs = 1.063% * 16000; errs = 13.387% * 16000; time = 0.5487s; samplesPerSecond = 29159.6
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 0.40356787 * 16000; top5Errs = 0.831% * 16000; errs = 11.613% * 16000; time = 0.5327s; samplesPerSecond = 30036.7
Finished Epoch[ 1 of 30]: [Training] ce = 0.65467487 * 60000; top5Errs = 3.083% * 60000; errs = 18.725% * 60000; totalSamplesSeen = 60000; learningRatePerSample = 0.003125; epochTime=3.40137s
SGD: Saving checkpoint model '../Output/Models/01_OneHidden.1'
Progressing well.
Whilst I will be using the GPU predominantly, I have written a very basic front-end with a performance tracking DB for model runs and I noticed this today while testing my code.
Make sure you
setx ACML_FMA 0
Check https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows
It鈥檚 an ACML bug.
Ah, and so it is, and fixed itself after a reboot. I thought I was immune as the CPU version had behaved perfectly up until now during 3 weeks of playing around. Beware of the ACML bug!
Thanks heaps. I hope it also helps TheRamones.
Worked for me too. Thanks.
Most helpful comment
Make sure you
setx ACML_FMA 0
Check https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows
It鈥檚 an ACML bug.