Pytorch-lightning: Avoid running `on_training_end` after Keyboard Interrupt

Created on 9 Mar 2020 · 20Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Right now, due to #795 , on_training_end runs after a KeyboardInterrupt. This ends up running code that's meant to be run only after successful training completion. This feature should either be reverted, or an alternative should be provided, so as to run some code only after successful training.

Motivation

I am training a model on Sagemaker and have added a notebook shutdown code within the on_training_end method. There were times where I had to manually cancel my model training because some parameters were incorrect. However, If I do that, the notebook shuts down immediately. This is because the on_training_end method runs even after a KeyboardInterrupt. I don't want my notebook shutting down after a keyboard interrupt, only after successful training completion.

Pitch

Maybe add an on_training_completed method for code that's meant to be run after successful training.

enhancement help wanted let's do it!

Source

lezwon

👍1

Most helpful comment

@jeremyjordan sounds like a good idea! also I was thinking that since we have TrainerStatus, can it be the main class while TrainerMode servers as a nested enum instead of putting them all under a single enum? something like:

class TrainerMode(enum.Enum):
    TRAINING = enum.auto()
    VALIDATING = enum.auto()
    TESTING = enum.auto()

class TrainerStatus(enum.Enum):
    PENDING = enum.auto()
    FAILED = enum.auto()
    INTERRUPTED = enum.auto()
    ...
    MODE = TrainerMode

then user can access the current status and mode through:

# Interrupted when validation is running
if ... is TrainerStatus.INTERRUPTED and ... is TrainerStatus.MODE.VALIDATING:

xingzhaolee on 22 Mar 2020

❤1 👍1

All 20 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 9 Mar 2020

Good point, it makes sense to skip eval of the unfinished train also when a user is it he doesn't want to wait for another hour lol Mind send a PR? :robot:

Borda on 12 Mar 2020

Thanks for bringing this use-case to our attention @lezwon , definitely something we'd want to consider.

What if the Trainer object had a status property? Similar to how compute jobs will have a status {PENDING, RUNNING, COMPLETE, FAILED} we could apply something similar here for the training job. In this case we could differentiate the two cases with a status of INTERRUPTED vs COMPLETED and your callback logic can check for the proper status before closing (eg. shutdown notebook on FAILED/COMPLETED but don't shutdown for INTERRUPTED).

What do you think?

jeremyjordan on 12 Mar 2020

@jeremyjordan that sounds great. We could definitely do that.

lezwon on 12 Mar 2020

Not sure if we really need to add complexity to the existing callbacks...
Also on_training_completed is standard behavior so rather some cleanup for interrupted?
@jeremyjordan status sounds cool, by my opinion it is much better signal the returned 0/1 from e.g. .fit =)

Borda on 12 Mar 2020

@Borda are you suggesting that we add a new callback (on_training_completed that @lezwon mentioned) which runs conditionally according to the value returned from fit()? My only worry is that it might be confusing to know the difference between on_training_end (existing) and on_training_complete (proposed).

jeremyjordan on 13 Mar 2020

in this moment I was just thinking about adding Trainer status...
not sure if we want to add complexity with new callback...

Borda on 13 Mar 2020

Ok yes I agree, I think the Trainer status would be a simple solution that may also be useful in other situations as well :)

jeremyjordan on 14 Mar 2020

Cool, does anyone send a PR with Trainer status? I may suggest to implement it as enum/numbering similar like logging.INFO/DEBUG/...

Borda on 14 Mar 2020

👍1

Will this status also account for validation/test completed/interrupted?

lezwon on 15 Mar 2020

yes we could do something like:

>>> from enum import Enum
>>> class TrainerStatus(Enum):
...     PENDING = "Initializing Trainer object"
...     TRAINING = "Optimizing model via train loop"
...     VALIDATING = "Running model on validation set"
...     TESTING = "Running model on test set"
...     FAILED = "Trainer failed to complete a successful run"
...     INTERRUPTED = "Training was interrupted by the user"
...     COMPLETED = "Training completed successfully"

I can work on getting this into a PR

jeremyjordan on 15 Mar 2020

👍1

Hey @jeremyjordan, what if we would like to execute different actions if the trainer failed during a validation task? Can we find out which task it failed at from the status? Just wondering if this would be flexible in such a scenario.

lezwon on 15 Mar 2020

@lezwon do you have an example in mind of where you'd need that?

i'd prefer to keep the implementation simple and generic enough such that we don't keep adding new status types.

jeremyjordan on 15 Mar 2020

Don't really have an example yet. Was just considering a scenario like that though.

I guess we could go ahead with the current proposal you mentioned. That should solve my issue for sure 😊

lezwon on 16 Mar 2020

👍1

probably want to consider building upon the Trainer state defined in #770

fyi @xingzhaolee - what do you think? (we don't need to include it in #770 just thinking about building off of that)

jeremyjordan on 21 Mar 2020

class TrainerMode(enum.Enum):
    TRAINING = enum.auto()
    VALIDATING = enum.auto()
    TESTING = enum.auto()

class TrainerStatus(enum.Enum):
    PENDING = enum.auto()
    FAILED = enum.auto()
    INTERRUPTED = enum.auto()
    ...
    MODE = TrainerMode

then user can access the current status and mode through:

# Interrupted when validation is running
if ... is TrainerStatus.INTERRUPTED and ... is TrainerStatus.MODE.VALIDATING:

xingzhaolee on 22 Mar 2020

❤1 👍1

sure! that would be flexible enough to address @lezwon 's earlier comment. i had started a branch to work on TrainerStatus, once #770 is merged i can pick up that work with your new suggestion here :)

jeremyjordan on 22 Mar 2020

👍1

if I understand it correctly the .auto means that it can be every run different enum value, right? Even it seems to be starting from 1 and incrementally increase...
if so I would recommend using some exact numbers in case of exporting/importing :]

Borda on 22 Mar 2020

not too sure about that, but probably the order matters too when new status or mode is added. so I guess we should use exact numbers like you suggested in case any import or export is involved! 🙂

xingzhaolee on 23 Mar 2020

👍1

i went back and forth on whether the trainer status would be a valuable addition. ultimately, i decided to opt for a simple attribute denoted when a KeyboardInterrupt has been caught.

jeremyjordan on 4 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings