Is your feature request related to a problem? Please describe.
In current version, there is no way to resume training from a specific checkpoint (not the last checkpoint).
Sometimes (very often in my case), one needs experiment training with different hyperparameters (e.g. dropout rate, augmentation) from a specific checkpoint.
Describe the solution you'd like
Add resume_from_checkpointargument to Trainer class.
Describe alternatives you've considered
I tried to use restore from Trainer class, but was not successful because it is meant to be used only when .fit() is called.
Additional context
FYI, I made a PR to demonstrate my idea: https://github.com/williamFalcon/pytorch-lightning/pull/516
Thanks for this. But may I ask why do we have two different stuff for restoring?
According to the doc, below should be the proper way to restore the whole session.
logger = TestTubeLogger(
save_dir='./lightning_logs',
version=restore_version
)
Also, there is another one documented in test method.
LightningModule.load_from_checkpoint()
But now, we have another function inside Trainer to restore the weights. A natural problem is which one will come finally? The final weights will be from the TestTubeLogger or the Trainer? Or the documentation is wrongly described the TestTubeLogger that it will restore the hyperparams setttings but not the weights? It causes confusion that forced me to read through the implementation.
To be clearer, shouldn't it make more sense to put the weights restorer like below:
restore_version:int
restore_checkpoint: int or Path-like
mode: enum. 'session-only', 'weights-only', 'session-weights'
logger = TestTubeLogger(
save_dir='./lightning_logs',
version=restore_version,
checkpoint=restore_checkpoint,
mode=mode
)
the Test-Tube logger is not mandatory for the lightning project, having an test-tube restore is an alternative option...
@Borda Thanks the clarify.
So for now, we have at least:
As a user, I thought it was recommended to use TestTube as it was documented here. Also, I did not find a similar doc in 6.0 talking about how should we restore from a checkpoint "properly".
Apart from the doc, I hope the API can be refactored to be aggregated together for simplicity. If TestTube is optional, I personally think we should move all checkpoint-restoring stuff to load_from_checkpoint.
I have been trying to figure out what's the right API to use to resume from checkpoint for half a day, and it's not in any of the example. The current method of resuming is not very well documented and not simple enough to use imo.
@stevenguh sorry, we just migrated docs.
https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html
load_from_checkpoint
Thank you for replying. I am still confused by differences between load_from_checkpoint in LightningModule and resume_from_checkpoint parameter in Trainer. It would be nice if there is a working example of resume. Also, the project advertises auto restore, but I have no idea how to activate it.
I can't even resume training from the last checkpoint based on new Trainer. Is this feature broken?
I can't find on the docs the correct way to set auto-resume/auto-load from the last best checkpoint neither.
TubeLogger did the job before, but since I updated libs, every time I train it starts from the beginning.
post an example? our tests for this are passing.
Most helpful comment
Thank you for replying. I am still confused by differences between
load_from_checkpointinLightningModuleandresume_from_checkpointparameter inTrainer. It would be nice if there is a working example of resume. Also, the project advertises auto restore, but I have no idea how to activate it.