Cntk: Tutorial 2 - CE not going down (no matter what I do)

Created on 30 Jun 2016 · 7Comments · Source: microsoft/CNTK

Hi,

I'm running the MNIST example of tutorial 2 but my CE and error rates do not go down. I can let it run for longer, but the error doesn't seem to change no matter how long I let it train.

I'm running the latest binaries on the website (v1.5). The config file that comes with this distribution is a bit different than the config file on the website. However, it does matter which one I use - The CE just won't go down. I tried increasing the learning rate and even setting it to zero, but nothing seems to change anything. I tried setting makeMode=false, but that doesn't make a difference either.

Any idea on what I'm doing wrong or how I can debug this?

Examples of the output I get.

Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30255688 * 16000; errTop1 = 0.89431250 * 16000; err = 0.89431250 * 16000; time = 1.9488s; samplesPerSecond = 8210.3
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30221021 * 16000; errTop1 = 0.89225000 * 16000; err = 0.89225000 * 16000; time = 0.6881s; samplesPerSecond = 23252.3
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30195801 * 16000; errTop1 = 0.88543750 * 16000; err = 0.88543750 * 16000; time = 0.6977s; samplesPerSecond = 22932.7
Finished Epoch[ 1 of 30]: [Training] ce = 2.30196224 * 60000; errTop1 = 0.89016667 * 60000; err = 0.89016667 * 60000; totalSamplesSeen = 60000; learningRatePerSample = 0.003125; epochTime=3.9701s
SGD: Saving checkpoint model './Output/Models/01_OneHidden.1'

.. And these number remain the same no matter how many epochs are completed:

tarting minibatch loop.
Epoch[28 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30140747 * 16000; errTop1 = 0.88781250 * 16000; err = 0.88781250 * 16000; time = 2.3025s; samplesPerSecond = 6948.8
Epoch[28 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30252905 * 16000; errTop1 = 0.88993750 * 16000; err = 0.88993750 * 16000; time = 0.8579s; samplesPerSecond = 18649.1
Epoch[28 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30222266 * 16000; errTop1 = 0.88981250 * 16000; err = 0.88981250 * 16000; time = 0.8466s; samplesPerSecond = 18900.1
Finished Epoch[28 of 30]: [Training] ce = 2.30194245 * 60000; errTop1 = 0.88906667 * 60000; err = 0.88906667 * 60000; totalSamplesSeen = 1680000; learningRatePerSample = 0.003125; epochTime=4.77177s
SGD: Saving checkpoint model './Output/Models/01_OneHidden.28'

Source

TheRamones

Most helpful comment

Make sure you

setx ACML_FMA 0

Check https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows

It’s an ACML bug.

dongyu888 on 6 Jul 2016

👍2

All 7 comments

Same problem here.

Funny thing is on my laptop everything works fine and I have low error rates. However when I switch to my pc (both CPU only mode ...) I get very high error rates like you :-) ...

Since I have on my PC "special/different" user rights my first guess is that there might be some "read-only" rights issues ... Other than that ... no idea whatsoever

Flav1u on 4 Jul 2016

😕1

I've got the same issue. Works fine on the GPU version but CPU version of compiled binary doesn't have the error reducing.

Built time: Jun 6 2016 13:12:33
Last modified date: Sat Jun 4 21:28:41 2016
Build type: Release
Build target: CPU-only
With 1bit-SGD: no
Build Branch: HEAD
Build SHA1: b7ed8dc9e5cd8ab35f4badae86dd42e93e9f2564
Built by svcphil on LIANA-09-w
Build Path: c:jenkinsworkspaceCNTK-Build-WindowsSourceCNTK

Was working before (previous version perhaps?). I installed Visual Studio 2013 since then on this machine. Note this exact version works fine on my other machine.

PeterGriffioen on 6 Jul 2016

Try reducing learning rate (e.g by a factor of 3) and see if it helps. AFAIR, CNTK team observed this behavior previously on CPU-only runs.

Alexey-Kamenev on 6 Jul 2016

I'm running this on the MNIST one hidden dataset.

cntk_cpu

Starting Epoch 1: learning rate per sample = 0.003125 effective momentum = 0.000000 momentum as time constant = 0.0 samples
BlockRandomizer::StartEpoch: epoch 0: frames 0..60000, data subset 0 of 1

Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 2.30255688 * 16000; top5Errs = 48.775% * 16000; errs = 89.431% * 16000; time = 1.7538s; samplesPerSecond = 9123.2
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 2.30221021 * 16000; top5Errs = 49.019% * 16000; errs = 89.225% * 16000; time = 0.6833s; samplesPerSecond = 23415.8
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 2.30195801 * 16000; top5Errs = 48.681% * 16000; errs = 88.544% * 16000; time = 0.6961s; samplesPerSecond = 22985.3

and basically stays there forever.

cntk_gpu

Starting Epoch 1: learning rate per sample = 0.003125 effective momentum = 0.000000 momentum as time constant = 0.0 samples
BlockRandomizer::StartEpoch: epoch 0: frames 0..60000, data subset 0 of 1

Starting minibatch loop.
Epoch[ 1 of 30]-Minibatch[ 1- 500, 26.67%]: ce = 1.28410046 * 16000; top5Errs = 9.194% * 16000; errs = 37.681% * 16000; time = 1.7513s; samplesPerSecond = 9135.9
Epoch[ 1 of 30]-Minibatch[ 501-1000, 53.33%]: ce = 0.49985193 * 16000; top5Errs = 1.063% * 16000; errs = 13.387% * 16000; time = 0.5487s; samplesPerSecond = 29159.6
Epoch[ 1 of 30]-Minibatch[1001-1500, 80.00%]: ce = 0.40356787 * 16000; top5Errs = 0.831% * 16000; errs = 11.613% * 16000; time = 0.5327s; samplesPerSecond = 30036.7
Finished Epoch[ 1 of 30]: [Training] ce = 0.65467487 * 60000; top5Errs = 3.083% * 60000; errs = 18.725% * 60000; totalSamplesSeen = 60000; learningRatePerSample = 0.003125; epochTime=3.40137s
SGD: Saving checkpoint model '../Output/Models/01_OneHidden.1'

Progressing well.

Whilst I will be using the GPU predominantly, I have written a very basic front-end with a performance tracking DB for model runs and I noticed this today while testing my code.

PeterGriffioen on 6 Jul 2016

Make sure you

setx ACML_FMA 0

Check https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows https://github.com/Microsoft/CNTK/wiki/Setup-CNTK-on-Windows

It’s an ACML bug.

dongyu888 on 6 Jul 2016

👍2

Ah, and so it is, and fixed itself after a reboot. I thought I was immune as the CPU version had behaved perfectly up until now during 3 weeks of playing around. Beware of the ACML bug!

Thanks heaps. I hope it also helps TheRamones.

PeterGriffioen on 6 Jul 2016

Worked for me too. Thanks.

Flav1u on 6 Jul 2016

Was this page helpful?

0 / 5 - 0 ratings