I want to be able to stop the trainer at any point based on arbitrary logic.
provide a hook to stop training at any point -- not just epoch end, and not just when a metric stop improving
The current EarlyStoppingCallback method only lets you stop at end of epoch, and that only when a metric stops improving by some amount. Other than that the only way I can see to stop training is to throw an exception and then call training teardown.
See https://pytorch-lightning.slack.com/archives/CQXV8BRH9/p1586281898171700
I want a function in the trainer like "end training" that I can call from anywhere in lightning which will cause training to stop, say, at the end of the current batch.
Throw an exception? Force the batch to return -1?
Hi! thanks for your contribution!, great first issue!
use a callback for that. and end training on_batch_end
https://pytorch-lightning.readthedocs.io/en/0.7.1/callbacks.html
thanks for the quick reply!
however... what should I do inside my callback? As far as I can tell there is no way to get training to actually stop. The way EarlyStoppingCallback does it is actually hard-coded into the training loop, in a way we couldn't do on on_batch_end
simplest way to fix it i think would be to add a binary flag to the trainer that's like "stop_requested" or something like that, and u check it after every batch, and if it's on, you return from train_epoch, and you check it again after each epoch
kind of duct tape but i'm happy to submit a PR
@PyTorchLightning/core-contributors
oh, the other way would be to just throw a custom exception. that's what's done for ctrl-c currently and what i'll be using for my own use case for now
The docs for training_step state that you can return a -1 to stop the current loop. Could this be useful in your case?
Ah ha! I think you've lead me to the solution here, although now I think the docs are wrong.
Relevant code
# ---------------
# RUN TRAIN STEP
# ---------------
output = self.run_training_batch(batch, batch_idx)
batch_result, grad_norm_dic, batch_step_metrics = output
# when returning -1 from train_step, we end epoch early
early_stop_epoch = batch_result == -1
So we need a -1 in the first output of run_training_batch. I don't think returning -1 from training_step does that. However, when I was looking into this before I failed to notice that you CAN force it to return -1 by returning -1 from on_batch_start
so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)
so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)
can confirm this works. might be nice to put that in the docs for early stopping as a third alternative, if no other changes will be made?
As mentioned on slack, @david-alexander-white very welcome to send a PR with doc update :rabbit:
Although I agree, that it would increase flexibility if there was such a flag that would be checked frequently
@justusschock what’s the benefit of that versus raising some sort of exception to stop training? right now if you raise KeyboardInterrupt it should gracefully exit training. we can always have a custom exception too
@jeremyjordan I simply don't like to use exceptions for something, that is somewhat expected (like a graceful shutdown or some kind of early stopping). IMO exceptions are just to handle errors correctly...
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I added a Trainer attribute should_stop in #1504 which you can set to True to stop training after the current batch, once that PR is merged you should be able to use that from anywhere (callbacks, model hooks, etc).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
@justusschock what’s the benefit of that versus raising some sort of exception to stop training? right now if you
raise KeyboardInterruptit should gracefully exit training. we can always have a custom exception too