Provide an ability to automatically stop training in case of performance on validation data deteriorates, or ceases to improve (for a given number consecutive epochs or steps).
Discussed in https://github.com/mozilla/DeepSpeech/issues/534 .
hey @tilmankamp, can I (try) help you with this?
Sure. Thanks for helping out! One thing: Tomorrow I'll put up a PR that will provide a routine for stopping a running training. It turned out to be much harder to do in a graceful way than I thought. You can use it for the stopping. I'll link it here.
Unfortunately there is no routine for ending training in my PR (#616).
However: Stopping has to happen anyhow from within the training coordinator state machine. This is quite a beast, I know.
@tilmankamp : There is a TF monitor that facilitates early stopping "ValidationMonitor" but so far i have only seen classifier's fit method taking monitors as argument and our code has no such calls. Can you suggest any pointers or APIs?
OR
Do we need to implement a new version of our own?
ValidationMonitor is part of tf.contrib.learn. It is TensorFlow's high level learning API, which we don't use. So yes, we have to implement it for ourselves. The training coordinator has to take this responsibility, as he receives the calculated losses per epoch. He'd have to monitor them and he has also the power to stop the training in a graceful way.
@tilmankamp : As per my understanding, the early stopping criteria would be based on loss incurred from validation dataset analyzed over the past n_steps, right ? Please correct me if my understanding is incorrect.
PS: I tried aggregating the losses from the jobs with set_name "dev" but while training the model, it seems the batches are continuously run over the training TED data set and after more than 1000 batch sets still train set is being used as per the debug logs.
If the above understanding is incorrect, then how would we get to know whether the model has crossed the generalization threshold and started overfitting?
Log Snippet:
D Finished batch step 1385.
D Sending Job (ID: 1084, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 1085, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
D Finished batch step 1386.
D Sending Job (ID: 1085, worker: 0, epoch: 0, set_name: train)...
D Computing Job (ID: 1086, worker: 0, epoch: 0, set_name: train)...
D Starting batch...
Yes, early stopping works like that - monitoring validation loss and stopping, if it goes up again.
For being able to monitor validation loss, you have to have validation steps configured in your parameters. If not, there will be no validation steps and as a consequence automatic early stopping cannot work.
Probably your code should log a warning, if the user specifies early stopping, but no validation step.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.