Dear all,
Keras is new to me! I just play with provided example script: mnist_mlp.py (https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py)
The accuracy is very good, but I found a interesting thing: although val_acc is improving, val_loss don't converge, more badly, after 20 epoch, the val_loss is larger than val_loss at first epoch. I have attached my training log, please check! Could you tell me the reason why this happen?
Thanks in advance!

It's due to over-fitting. You can observe that train_loss is decreasing all the way and val_loss increases after 7 epoch, which is symptom of over-fitting.
I DON'T believe you are over-fitting.
Your validation accuracy is improving.
That is a feature of online learning.
http://stackoverflow.com/questions/40910857/how-to-interpret-increase-in-both-loss-and-accuracy
You are doing a good job with regularization.
What happens is that you are using a log-loss, and the model is trying to reduce the average loss. There is two main way of reducing an average : either you reduce the worst example a lot (that what mean square error does), or you try to reduce everyone a little (that's what log loss does). By doing that, and ensuring that you don't have enough parameter to overfit (i.e. memorise) your training example by regularizing them enough, you are learning to generalize well, and thus your model performance on unseen data should keep improving.
I think I saw this explanation when watching one of those online-course type of thing but can't remember the exact place. And couldn't find it anymore by doing a quick search.
In any case, when in doubt on when to stop, use an heuristic like early-stopping, this way you will be able to compare your model in a more robust way.
@unrealwill the objective is validation cross entropy, you can not interpret over-fitting or not by validation accuracy. Dropout with 0.2 is relative weak for this model, so I don't think regularization is well enough.
This is the experiment to set dropout to 0.4, 0.5, and switch optimizer to Adam:
Epoch 1/20 loss: 0.3128 - acc: 0.9034 - val_loss: 0.1194 - val_acc: 0.9613
Epoch 2/20 loss: 0.1403 - acc: 0.9572 - val_loss: 0.0878 - val_acc: 0.9726
Epoch 3/20 loss: 0.1069 - acc: 0.9669 - val_loss: 0.0773 - val_acc: 0.9745
Epoch 4/20 loss: 0.0908 - acc: 0.9719 - val_loss: 0.0711 - val_acc: 0.9767
Epoch 5/20 loss: 0.0790 - acc: 0.9748 - val_loss: 0.0735 - val_acc: 0.9786
Epoch 6/20 loss: 0.0712 - acc: 0.9776 - val_loss: 0.0585 - val_acc: 0.9818
Epoch 7/20 loss: 0.0634 - acc: 0.9805 - val_loss: 0.0648 - val_acc: 0.9808
Epoch 8/20 loss: 0.0597 - acc: 0.9806 - val_loss: 0.0593 - val_acc: 0.9825
Epoch 9/20 loss: 0.0504 - acc: 0.9837 - val_loss: 0.0637 - val_acc: 0.9812
Epoch 10/20 loss: 0.0494 - acc: 0.9842 - val_loss: 0.0613 - val_acc: 0.9833
Epoch 11/20 loss: 0.0466 - acc: 0.9846 - val_loss: 0.0694 - val_acc: 0.9814
Epoch 12/20 loss: 0.0458 - acc: 0.9851 - val_loss: 0.0677 - val_acc: 0.9811
Epoch 13/20 loss: 0.0434 - acc: 0.9861 - val_loss: 0.0640 - val_acc: 0.9822
Epoch 14/20 loss: 0.0422 - acc: 0.9862 - val_loss: 0.0604 - val_acc: 0.9835
Epoch 15/20 loss: 0.0380 - acc: 0.9875 - val_loss: 0.0710 - val_acc: 0.9827
Epoch 16/20 loss: 0.0378 - acc: 0.9874 - val_loss: 0.0629 - val_acc: 0.9838
Epoch 17/20 loss: 0.0386 - acc: 0.9879 - val_loss: 0.0632 - val_acc: 0.9842
Epoch 18/20 loss: 0.0372 - acc: 0.9886 - val_loss: 0.0652 - val_acc: 0.9834
Epoch 19/20 loss: 0.0326 - acc: 0.9898 - val_loss: 0.0689 - val_acc: 0.9830
Epoch 20/20 loss: 0.0350 - acc: 0.9889 - val_loss: 0.0618 - val_acc: 0.9854
Test score: 0.061833061361
Test accuracy: 0.9854
@joelthchao I have always thought the goal for mnist was to improve the val accuracy (i.e, the ultimate goal was to predict a single digit, instead of the distribution of uncertainty upon digit).You are indeed correct, if your goal is to decrease validation_loss (validation_cross_entropy), then you need to regularize more.
@unrealwill @joelthchao thanks for your kindly reply. As joelthchao suggested, I modify the original script, like followings:
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
But I still can observe the val_loss is firstly decrease and then increase after some epoch. I cannot repeat @joelthchao

's reported result.
@unrealwill @joelthchao, One more question, after modifying dropout parameter and optimizer options, If I still can observe overfitting, can I use call back to save best model. Namely, I just run 100 epoch, and then select smallest val_loss or biggest val_acc using following command:
ModelCheckpoint(kfold_weights_path, monitor='val_loss', save_best_only=True, verbose=0),
or
ModelCheckpoint(kfold_weights_path, monitor='val_acc', save_best_only=True, verbose=0),
@GatechLW
Over-fitting is inevitable once your model is still too powerful. This is a trade-off problem:
And, yes, you can do that. Early stop is another way to prevent over-fitting.
@joelthchao, OK, your explanation is clear! thanks! One thing about ModelCheckpoint I am still a confused with what you said. Namely I think Early stop is different with ModelCheckpoint. ModelCheckpoint(kfold_weights_path, monitor='val_loss', save_best_only=True, verbose=0),
For example, if set your total epoch is 100. Early stop might be stopped when the loss is increased out of your tolerance. e.g., it might only execute 50 epoch and save the last epoch parameter as final neural network parameter.
However, for ModelCheckpoint, I think it will execute 100 epochs, and then it will automatically save the parameter which give best loss during this 100 epochs.
you can checkout this gist:
https://gist.github.com/dusenberrymw/89bc12a8f9a9afaacdb91668abe4065d
i think it ovefit, so the probability of prediction close to critial value(maybe 0.5 is your critial value),so the loss will be higher but the accuracy is flat.
Most helpful comment
I DON'T believe you are over-fitting.
Your validation accuracy is improving.
That is a feature of online learning.
http://stackoverflow.com/questions/40910857/how-to-interpret-increase-in-both-loss-and-accuracy
You are doing a good job with regularization.
What happens is that you are using a log-loss, and the model is trying to reduce the average loss. There is two main way of reducing an average : either you reduce the worst example a lot (that what mean square error does), or you try to reduce everyone a little (that's what log loss does). By doing that, and ensuring that you don't have enough parameter to overfit (i.e. memorise) your training example by regularizing them enough, you are learning to generalize well, and thus your model performance on unseen data should keep improving.
I think I saw this explanation when watching one of those online-course type of thing but can't remember the exact place. And couldn't find it anymore by doing a quick search.
In any case, when in doubt on when to stop, use an heuristic like early-stopping, this way you will be able to compare your model in a more robust way.