Keras: Loss Increases after some epochs

Created on 11 Aug 2017 · 12Comments · Source: keras-team/keras

I have tried different convolutional neural network codes and I am running into a similar issue. The network starts out training well and decreases the loss but after sometime the loss just starts to increase. I have shown an example below:
Epoch 15/800
1562/1562 [==============================] - 49s - loss: 0.9050 - acc: 0.6827 - val_loss: 0.7667 - val_acc: 0.7323
Epoch 16/800
1562/1562 [==============================] - 49s - loss: 0.8906 - acc: 0.6864 - val_loss: 0.7404 - val_acc: 0.7434
Epoch 380/800
1562/1562 [==============================] - 49s - loss: 1.5519 - acc: 0.4880 - val_loss: 1.4250 - val_acc: 0.5233
Epoch 381/800
1562/1562 [==============================] - 48s - loss: 1.5416 - acc: 0.4897 - val_loss: 1.5032 - val_acc: 0.4868
Epoch 800/800
1562/1562 [==============================] - 49s - loss: 1.8483 - acc: 0.3402 - val_loss: 1.9454 - val_acc: 0.2398

I have tried this on different cifar10 architectures I have found on githubs. I am training this on a GPU Titan-X Pascal. This only happens when I train the network in batches and with data augmentation. I have changed the optimizer, the initial learning rate etc. I have also attached a link to the code. I just want a cifar10 model with good enough accuracy for my tests, so any help will be appreciated. The code is from this:
https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py

Source

ktiwary2

👍1

Most helpful comment

Look, when using raw SGD, you pick a gradient of loss function w.r.t. parameters (the direction which increases function value) and go to opposite direction little bit (in order to minimize the loss function).
There are different optimizers built on top of SGD using some ideas (momentum, learning rate decay, etc...) to make convergence faster.
If you look how momentum works, you'll understand where's the problem. In the beginning, the optimizer may go in same direction (not wrong) some long time, which will cause very big momentum. Then the opposite direction of gradient may not match with momentum causing optimizer "climb hills" (get higher loss values) some time, but it may eventually fix himself.
(I encourage you to see how momentum works)
https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum

mahnerak on 12 Aug 2017

👍31 🎉5

All 12 comments

I believe that you have tried different optimizers, but please try raw SGD with smaller initial learning rate.
Most likely the optimizer gains high momentum and continues to move along wrong direction since some moment.

mahnerak on 11 Aug 2017

👍2

So something like this?
lrate = 0.001
decay = lrate/epochs
sgd = SGD(lr=lrate, momentum=0.90, decay=decay, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

ktiwary2 on 11 Aug 2017

No, without any momentum and decay, just a raw SGD.

model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])

mahnerak on 11 Aug 2017

Thanks, that works. I was wondering if you know why that is?

ktiwary2 on 12 Aug 2017

mahnerak on 12 Aug 2017

👍31 🎉5

Ok, I will definitely keep this in mind in the future. Thanks for the help.

ktiwary2 on 12 Aug 2017

Hello,
I'm using CNN for regression and I'm using MAE metric to evaluate the performance of the model. But I noted that the Loss, Val_loss, Mean absolute value and Val_Mean absolute value are not changed after some epochs.

fatemaaa on 20 Nov 2018

My loss was at 0.05 but after some epoch it went up to 15 , even with a raw SGD. High epoch dint effect with Adam but only with SGD optimiser.
Pls help

vjbharani on 4 Jan 2019

@mahnerak
Hi thank you for your explanation. I experienced similar problem.

BTW, I have an question about _"but it may eventually fix himself"_.
Does it mean loss can start going down again after many more epochs even with momentum, at least theoretically?

Thanks in advance.

kouohhashi on 12 Feb 2019

Hi @kouohhashi,
I suggest you reading Distill publication: https://distill.pub/2017/momentum/

Authors mention "It is possible, however, to construct very specific counterexamples where momentum does not converge, even on convex functions."
Please also take a look https://arxiv.org/abs/1408.3595 for more details.

mahnerak on 12 Feb 2019

😄3

Look, when using raw SGD, you pick a gradient of loss function w.r.t. parameters (the direction which increases function value) and go to opposite direction little bit (in order to minimize the loss function).
There are different optimizers built on top of SGD using some ideas (momentum, learning rate decay, etc...) to make convergence faster.
If you look how momentum works, you'll understand where's the problem. In the beginning, the optimizer may go in same direction (not wrong) some long time, which will cause very big momentum. Then the opposite direction of gradient may not match with momentum causing optimizer "climb hills" (get higher loss values) some time, but it may eventually fix himself.
(I encourage you to see how momentum works)
https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Momentum

Are you suggesting that momentum be removed altogether or for troubleshooting? If you mean the latter how should one use momentum after debugging?
Thanks.