Incubator-mxnet: How to implement early stopping ?

Created on 6 Nov 2016 · 10Comments · Source: apache/incubator-mxnet

I have seen this question around but no answer has emerged...
By early stopping i mean stopping the training when the validation error does not decrease in a given number of epochs.
I ask because the validation set should be used to control overfitting and without implementing early stopping this is not possible.
In the attachment you can see the training curve in blue and the validation curve in red, you can clearly see that the model has overfit. The training error goes down while the validation error rises.
The plot is 1 - acc, just to make it easier to see the overfitting

attachment

Source

ghost

👍2

Most helpful comment

Use monitor and checkpoint. Monitor the parameter, once it fulfill some conditions you save the model.

VoVAllen on 6 Nov 2016

👍4

All 10 comments

Just to make it more clearly, I want to implement it in python...
It was fairly easy to implement a callback function that logs the training history but I can't stop the training in python...

I tried:

def early_stopping_callback(period, early_stop_epochs):
    def _callback(epoch, symbol, arg_params, aux_params):
        return False
    return _callback

To see if the training would stop after one epoch but it did not...

ghost on 6 Nov 2016

Use monitor and checkpoint. Monitor the parameter, once it fulfill some conditions you save the model.

VoVAllen on 6 Nov 2016

👍4

Thanks for the reply
I could do that but that does not stop the training and i will have to wait for the training to finish even knowing it will be useless...
And i also would like to do something similar to stop when the training error has converged.
Take the above example, the algorithm converged very quickly and i had to wait for the training to finish.

ghost on 6 Nov 2016

In the R code it is pretty simple and is already implemented.
I just would like to know why in python we don't have something similar ?
If the worries are backward compatibility just check for None in the condition, or is there some reason that it is the way it is ?

model <- mx.model.extract.model(symbol, train.execs)

    epoch_continue <- TRUE
    if (!is.null(epoch.end.callback)) {
      epoch_continue <- epoch.end.callback(iteration, 0, environment(), verbose = verbose)
    }

    if (!epoch_continue) {
      break
    }

ghost on 7 Nov 2016

I've read several codes in the repo. It seems you can stop the model by return False in the epoch_ends_callback function. I'm not sure whether this would work.

VoVAllen on 8 Nov 2016

No, it does not work. The relevant code is in the base_module class in the file of the same name, so I did a simple modification there so we could actually stop the training.

The original code is :

for callback in _as_list(epoch_end_callback):
    callback(epoch, self.symbol, arg_params, aux_params):

For now I did this simple modification (against my better judgement):

for callback in _as_list(epoch_end_callback):
    if not callback(epoch, self.symbol, arg_params, aux_params):
        return

It does the job but i would like to know from one of developers if this can be implemented in the official repo (it is literally two lines of code), it is never a good idea to change code from third party packages...

ghost on 8 Nov 2016

@piiswrong

VoVAllen on 8 Nov 2016

No comment on this issue ?? @piiswrong

ghost on 17 Nov 2016

I have a tentative solution that works, but as you said you have to modify base_module.py in mxnet. The repo has the test app, the Gist (see the readme.md) has the base_module changes. Its for mxnet version 0.9.4
https://github.com/kperkins411/mxnet_demo_earlystopping