Right now, due to #795 , on_training_end runs after a KeyboardInterrupt. This ends up running code that's meant to be run only after successful training completion. This feature should either be reverted, or an alternative should be provided, so as to run some code only after successful training.
I am training a model on Sagemaker and have added a notebook shutdown code within the on_training_end method. There were times where I had to manually cancel my model training because some parameters were incorrect. However, If I do that, the notebook shuts down immediately. This is because the on_training_end method runs even after a KeyboardInterrupt. I don't want my notebook shutting down after a keyboard interrupt, only after successful training completion.
Maybe add an on_training_completed method for code that's meant to be run after successful training.
Hi! thanks for your contribution!, great first issue!
Good point, it makes sense to skip eval of the unfinished train also when a user is it he doesn't want to wait for another hour lol Mind send a PR? :robot:
Thanks for bringing this use-case to our attention @lezwon , definitely something we'd want to consider.
What if the Trainer object had a status property? Similar to how compute jobs will have a status {PENDING, RUNNING, COMPLETE, FAILED} we could apply something similar here for the training job. In this case we could differentiate the two cases with a status of INTERRUPTED vs COMPLETED and your callback logic can check for the proper status before closing (eg. shutdown notebook on FAILED/COMPLETED but don't shutdown for INTERRUPTED).
What do you think?
@jeremyjordan that sounds great. We could definitely do that.
Not sure if we really need to add complexity to the existing callbacks...
Also on_training_completed is standard behavior so rather some cleanup for interrupted?
@jeremyjordan status sounds cool, by my opinion it is much better signal the returned 0/1 from e.g. .fit =)
@Borda are you suggesting that we add a new callback (on_training_completed that @lezwon mentioned) which runs conditionally according to the value returned from fit()? My only worry is that it might be confusing to know the difference between on_training_end (existing) and on_training_complete (proposed).
in this moment I was just thinking about adding Trainer status...
not sure if we want to add complexity with new callback...
Ok yes I agree, I think the Trainer status would be a simple solution that may also be useful in other situations as well :)
Cool, does anyone send a PR with Trainer status? I may suggest to implement it as enum/numbering similar like logging.INFO/DEBUG/...
Will this status also account for validation/test completed/interrupted?
yes we could do something like:
>>> from enum import Enum
>>> class TrainerStatus(Enum):
... PENDING = "Initializing Trainer object"
... TRAINING = "Optimizing model via train loop"
... VALIDATING = "Running model on validation set"
... TESTING = "Running model on test set"
... FAILED = "Trainer failed to complete a successful run"
... INTERRUPTED = "Training was interrupted by the user"
... COMPLETED = "Training completed successfully"
I can work on getting this into a PR
Hey @jeremyjordan, what if we would like to execute different actions if the trainer failed during a validation task? Can we find out which task it failed at from the status? Just wondering if this would be flexible in such a scenario.
@lezwon do you have an example in mind of where you'd need that?
i'd prefer to keep the implementation simple and generic enough such that we don't keep adding new status types.
Don't really have an example yet. Was just considering a scenario like that though.
I guess we could go ahead with the current proposal you mentioned. That should solve my issue for sure 😊
probably want to consider building upon the Trainer state defined in #770
fyi @xingzhaolee - what do you think? (we don't need to include it in #770 just thinking about building off of that)
@jeremyjordan sounds like a good idea! also I was thinking that since we have TrainerStatus, can it be the main class while TrainerMode servers as a nested enum instead of putting them all under a single enum? something like:
class TrainerMode(enum.Enum):
TRAINING = enum.auto()
VALIDATING = enum.auto()
TESTING = enum.auto()
class TrainerStatus(enum.Enum):
PENDING = enum.auto()
FAILED = enum.auto()
INTERRUPTED = enum.auto()
...
MODE = TrainerMode
then user can access the current status and mode through:
# Interrupted when validation is running
if ... is TrainerStatus.INTERRUPTED and ... is TrainerStatus.MODE.VALIDATING:
sure! that would be flexible enough to address @lezwon 's earlier comment. i had started a branch to work on TrainerStatus, once #770 is merged i can pick up that work with your new suggestion here :)
if I understand it correctly the .auto means that it can be every run different enum value, right? Even it seems to be starting from 1 and incrementally increase...
if so I would recommend using some exact numbers in case of exporting/importing :]
not too sure about that, but probably the order matters too when new status or mode is added. so I guess we should use exact numbers like you suggested in case any import or export is involved! 🙂
i went back and forth on whether the trainer status would be a valuable addition. ultimately, i decided to opt for a simple attribute denoted when a KeyboardInterrupt has been caught.
Most helpful comment
@jeremyjordan sounds like a good idea! also I was thinking that since we have TrainerStatus, can it be the main class while TrainerMode servers as a nested enum instead of putting them all under a single enum? something like:
then user can access the current status and mode through: