Pytorch-lightning: Add resuming from specific checkpoint

Created on 15 Nov 2019 · 8Comments · Source: PyTorchLightning/pytorch-lightning

Is your feature request related to a problem? Please describe.
In current version, there is no way to resume training from a specific checkpoint (not the last checkpoint).
Sometimes (very often in my case), one needs experiment training with different hyperparameters (e.g. dropout rate, augmentation) from a specific checkpoint.

Describe the solution you'd like
Add resume_from_checkpointargument to Trainer class.

Describe alternatives you've considered
I tried to use restore from Trainer class, but was not successful because it is meant to be used only when .fit() is called.

Additional context
FYI, I made a PR to demonstrate my idea: https://github.com/williamFalcon/pytorch-lightning/pull/516

enhancement help wanted

Source

dreamgonfly

Most helpful comment

Thank you for replying. I am still confused by differences between load_from_checkpoint in LightningModule and resume_from_checkpoint parameter in Trainer. It would be nice if there is a working example of resume. Also, the project advertises auto restore, but I have no idea how to activate it.

stevenguh on 7 Feb 2020

👍12

All 8 comments

Thanks for this. But may I ask why do we have two different stuff for restoring?

According to the doc, below should be the proper way to restore the whole session.

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version
        )

Also, there is another one documented in test method.

LightningModule.load_from_checkpoint()

But now, we have another function inside Trainer to restore the weights. A natural problem is which one will come finally? The final weights will be from the TestTubeLogger or the Trainer? Or the documentation is wrongly described the TestTubeLogger that it will restore the hyperparams setttings but not the weights? It causes confusion that forced me to read through the implementation.

To be clearer, shouldn't it make more sense to put the weights restorer like below:

        restore_version:int
        restore_checkpoint: int or Path-like
        mode: enum. 'session-only', 'weights-only', 'session-weights'

        logger = TestTubeLogger(
            save_dir='./lightning_logs',
            version=restore_version,
            checkpoint=restore_checkpoint,
            mode=mode
        )

shijianjian on 22 Jan 2020

👍1

the Test-Tube logger is not mandatory for the lightning project, having an test-tube restore is an alternative option...

Borda on 22 Jan 2020

@Borda Thanks the clarify.
So for now, we have at least:

TestTube
LightningModule.load_from_checkpoint
resume_from_checkpoint

As a user, I thought it was recommended to use TestTube as it was documented here. Also, I did not find a similar doc in 6.0 talking about how should we restore from a checkpoint "properly".

Apart from the doc, I hope the API can be refactored to be aggregated together for simplicity. If TestTube is optional, I personally think we should move all checkpoint-restoring stuff to load_from_checkpoint.

shijianjian on 22 Jan 2020

👍2

I have been trying to figure out what's the right API to use to resume from checkpoint for half a day, and it's not in any of the example. The current method of resuming is not very well documented and not simple enough to use imo.

stevenguh on 7 Feb 2020

@stevenguh sorry, we just migrated docs.

https://pytorch-lightning.readthedocs.io/en/latest/lightning-module.html

load_from_checkpoint

williamFalcon on 7 Feb 2020

stevenguh on 7 Feb 2020

👍12

I can't even resume training from the last checkpoint based on new Trainer. Is this feature broken?
I can't find on the docs the correct way to set auto-resume/auto-load from the last best checkpoint neither.

TubeLogger did the job before, but since I updated libs, every time I train it starts from the beginning.