Pytorch-lightning: provide a hook to stop training at any point

Created on 7 Apr 2020 · 18Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

I want to be able to stop the trainer at any point based on arbitrary logic.
provide a hook to stop training at any point -- not just epoch end, and not just when a metric stop improving

Motivation

The current EarlyStoppingCallback method only lets you stop at end of epoch, and that only when a metric stops improving by some amount. Other than that the only way I can see to stop training is to throw an exception and then call training teardown.

See https://pytorch-lightning.slack.com/archives/CQXV8BRH9/p1586281898171700

Pitch

I want a function in the trainer like "end training" that I can call from anywhere in lightning which will cause training to stop, say, at the end of the current batch.

Alternatives

Throw an exception? Force the batch to return -1?

Additional context

enhancement help wanted won't fix

Source

david-alexander-white

Most helpful comment

@justusschock what’s the benefit of that versus raising some sort of exception to stop training? right now if you raise KeyboardInterrupt it should gracefully exit training. we can always have a custom exception too

jeremyjordan on 8 Apr 2020

👎2 👍2

All 18 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 7 Apr 2020

use a callback for that. and end training on_batch_end

williamFalcon on 7 Apr 2020

https://pytorch-lightning.readthedocs.io/en/0.7.1/callbacks.html

williamFalcon on 7 Apr 2020

https://pytorch-lightning.readthedocs.io/en/0.7.1/callbacks.html

thanks for the quick reply!

however... what should I do inside my callback? As far as I can tell there is no way to get training to actually stop. The way EarlyStoppingCallback does it is actually hard-coded into the training loop, in a way we couldn't do on on_batch_end

david-alexander-white on 7 Apr 2020

simplest way to fix it i think would be to add a binary flag to the trainer that's like "stop_requested" or something like that, and u check it after every batch, and if it's on, you return from train_epoch, and you check it again after each epoch

david-alexander-white on 7 Apr 2020

kind of duct tape but i'm happy to submit a PR

david-alexander-white on 7 Apr 2020

@PyTorchLightning/core-contributors

williamFalcon on 7 Apr 2020

oh, the other way would be to just throw a custom exception. that's what's done for ctrl-c currently and what i'll be using for my own use case for now

david-alexander-white on 7 Apr 2020

The docs for training_step state that you can return a -1 to stop the current loop. Could this be useful in your case?

awaelchli on 7 Apr 2020

Ah ha! I think you've lead me to the solution here, although now I think the docs are wrong.

Relevant code

            # ---------------
            # RUN TRAIN STEP
            # ---------------
            output = self.run_training_batch(batch, batch_idx)
            batch_result, grad_norm_dic, batch_step_metrics = output

            # when returning -1 from train_step, we end epoch early
            early_stop_epoch = batch_result == -1

So we need a -1 in the first output of run_training_batch. I don't think returning -1 from training_step does that. However, when I was looking into this before I failed to notice that you CAN force it to return -1 by returning -1 from on_batch_start

so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)

david-alexander-white on 7 Apr 2020

so I think that's the solution -- to keep returning -1 from on_batch_start (may have to be a bunch of times depending on how many epochs you have)

can confirm this works. might be nice to put that in the docs for early stopping as a third alternative, if no other changes will be made?

david-alexander-white on 7 Apr 2020

As mentioned on slack, @david-alexander-white very welcome to send a PR with doc update :rabbit:

Borda on 7 Apr 2020

🚀1

Although I agree, that it would increase flexibility if there was such a flag that would be checked frequently

justusschock on 8 Apr 2020

jeremyjordan on 8 Apr 2020

👎2 👍2

@jeremyjordan I simply don't like to use exceptions for something, that is somewhat expected (like a graceful shutdown or some kind of early stopping). IMO exceptions are just to handle errors correctly...

justusschock on 13 Apr 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 12 Jun 2020

I added a Trainer attribute should_stop in #1504 which you can set to True to stop training after the current batch, once that PR is merged you should be able to use that from anywhere (callbacks, model hooks, etc).

jeremyjordan on 15 Jun 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 14 Aug 2020

Was this page helpful?

4 / 5 - 1 ratings