Pytorch-lightning: Loading from best checkpoint

Created on 6 Apr 2020 · 8Comments · Source: PyTorchLightning/pytorch-lightning

🚀 Feature

Currently Pytorch lightning uses the latest version of the model for testing. In research, we want to first load the best checkpoint and do the testing from there. Also it would be good to restart from the best checkpoint after learning rate plateau as an option.

Motivation

We want the best model for training/testing. Also for NLP it is more natural to go to the best checkpoint and restart with decayed learning rate.

Alternatives

Manual loading and checking of the validation value (which is against lightning principal).

Hacktoberfest enhancement good first issue help wanted won't fix

Source

philip30

👍4 🚀1

Most helpful comment

There are 2 issues here:

Trainer has a training loop which involves testing (test_step, test_end) which does not behave as above [it uses the latest version of the model during training]. Meaning that I have to make a separate script for testing that load this checkpoint manually.
After loss increased in the development set, we would want to restart from best checkpoint, reduce the learning rate and restart the epoch totally from that checkpoint. Currently this can't be achieved without an external bash script that tracks the model evaluation performace and (1) kill the training if loss increased, (2) restart with decayed learning rate. Which is too much work.

At the beginning I think it is as easy as loading the checkpoint internally (during training loop). But I realized that in multi GPUs setting, these parameters need to be copied to each pytorch module inside every GPU.

In summary:

Let's implement module.restart_from_checkpoint_(.) for pytorch lightning module.

philip30 on 8 Apr 2020

🚀2

All 8 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 6 Apr 2020

this makes sense. how do you suggest to do it? ideally you do this in lightning

model = Model.load_from_checkpoint(PATH)
trainer.test(model)

why doesn’t this fit your use case?

williamFalcon on 8 Apr 2020

There are 2 issues here:

Trainer has a training loop which involves testing (test_step, test_end) which does not behave as above [it uses the latest version of the model during training]. Meaning that I have to make a separate script for testing that load this checkpoint manually.
After loss increased in the development set, we would want to restart from best checkpoint, reduce the learning rate and restart the epoch totally from that checkpoint. Currently this can't be achieved without an external bash script that tracks the model evaluation performace and (1) kill the training if loss increased, (2) restart with decayed learning rate. Which is too much work.

In summary:

Let's implement module.restart_from_checkpoint_(.) for pytorch lightning module.

philip30 on 8 Apr 2020

🚀2

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 7 Jun 2020

This will need to have the specific checkpoint path right?
Is there a way to give it the dir path, and it will load the best ckpt based on CheckpointModel obejct logic?

this makes sense. how do you suggest to do it? ideally you do this in lightning

model = Model.load_from_checkpoint(PATH)
trainer.test(model)

why doesn’t this fit your use case?

dvirginz on 13 Jun 2020

This will need to have the specific checkpoint path right?
Is there a way to give it the dir path, and it will load the best ckpt based on CheckpointModel obejct logic?

yes the path is in the Checkoint

Borda on 4 Aug 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale[bot] on 25 Oct 2020

already solved... 🐰

Borda on 25 Oct 2020

Was this page helpful?

0 / 5 - 0 ratings