Pytorch-lightning: provide a hook to stop training at any point

Created on 7 Apr 2020  Â·  18Comments  Â·  Source: PyTorchLightning/pytorch-lightning

🚀 Feature

I want to be able to stop the trainer at any point based on arbitrary logic.
provide a hook to stop training at any point -- not just epoch end, and not just when a metric stop improving

Motivation

The current EarlyStoppingCallback method only lets you stop at end of epoch, and that only when a metric stops improving by some amount. Other than that the only way I can see to stop training is to throw an exception and then call training teardown.

See https://pytorch-lightning.slack.com/archives/CQXV8BRH9/p1586281898171700

Pitch

I want a function in the trainer like "end training" that I can call from anywhere in lightning which will cause training to stop, say, at the end of the current batch.

Alternatives

Throw an exception? Force the batch to return -1?

Additional context

enhancement help wanted won't fix

Most helpful comment

@justusschock what’s the benefit of that versus raising some sort of exception to stop training? right now if you raise KeyboardInterrupt it should gracefully exit training. we can always have a custom exception too

All 18 comments

Hi! thanks for your contribution!, great first issue!

use a callback for that. and end training on_batch_end

https://pytorch-lightning.readthedocs.io/en/0.7.1/callbacks.html

thanks for the quick reply!

however... what should I do inside my callback? As far as I can tell there is no way to get training to actually stop. The way EarlyStoppingCallback does it is actually hard-coded into the training loop, in a way we couldn't do on on_batch_end

simplest way to fix it i think would be to add a binary flag to the trainer that's like "stop_requested" or something like that, and u check it after every batch, and if it's on, you return from train_epoch, and you check it again after each epoch

kind of duct tape but i'm happy to submit a PR

@PyTorchLightning/core-contributors

oh, the other way would be to just throw a custom exception. that's what's done for ctrl-c currently and what i'll be using for my own use case for now

The docs for training_step state that you can return a -1 to stop the current loop. Could this be useful in your case?

Ah ha! I think you've lead me to the solution here, although now I think the docs are wrong.

Relevant code

            # ---------------
            # RUN TRAIN STEP
            # ---------------
            output = self.run_training_batch(batch, batch_idx)
            batch_result, grad_norm_dic, batch_step_metrics = output

            # when returning -1 from train_step, we end epoch early
            early_stop_epoch = batch_result == -1

So we need a -1 in the first output of run_training_batch. I don't think returning -1 from training_step does that. However, when I was looking into this before I failed to notice that you CAN force it to return -1 by returning -1 from on_batch_start

so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)

so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)

can confirm this works. might be nice to put that in the docs for early stopping as a third alternative, if no other changes will be made?

As mentioned on slack, @david-alexander-white very welcome to send a PR with doc update :rabbit:

Although I agree, that it would increase flexibility if there was such a flag that would be checked frequently

@justusschock what’s the benefit of that versus raising some sort of exception to stop training? right now if you raise KeyboardInterrupt it should gracefully exit training. we can always have a custom exception too

@jeremyjordan I simply don't like to use exceptions for something, that is somewhat expected (like a graceful shutdown or some kind of early stopping). IMO exceptions are just to handle errors correctly...

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I added a Trainer attribute should_stop in #1504 which you can set to True to stop training after the current batch, once that PR is merged you should be able to use that from anywhere (callbacks, model hooks, etc).

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
4 / 5 - 1 ratings

Related issues

Vichoko picture Vichoko  Â·  3Comments

as754770178 picture as754770178  Â·  3Comments

awaelchli picture awaelchli  Â·  3Comments

baeseongsu picture baeseongsu  Â·  3Comments

edenlightning picture edenlightning  Â·  3Comments