Pytorch-lightning: How to save the model after certain steps instead of epoch?

Created on 13 May 2020  ยท  9Comments  ยท  Source: PyTorchLightning/pytorch-lightning

โ“ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I am trying to train a NN model on a super-big tabular data(about half billion), and I am wondering if I can save the data every certain steps(a million for example) in an epoch instead of every epoch because it indeed spend too much times. I don't know if it is possible in PytorchLightning framework.

Code

What have you tried?

What's your environment?

  • OS: linux
  • Packaging conda
  • Version 0.7.5
question won't fix

Most helpful comment

It works but will disregard the save_top_k argument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.

https://github.com/PyTorchLightning/pytorch-lightning/blob/8c4c7b105e16fbe255e4715f54af2fa5d2a12fad/pytorch_lightning/callbacks/model_checkpoint.py#L214

My callback looks like:

    checkpoint_callback = ModelCheckpoint(
        save_top_k=10,
        verbose=True,
        monitor='val_loss',
        mode='min',
        period=-1,
    )

All 9 comments

Hi! thanks for your contribution!, great first issue!

Don't know if any functionality is present in PyTorchLightning to handle this but you can still save the model within training_step if you want. You get a batch_nb in training_step, you can use that to save the model according to your priorities.

https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval

Check val in a set number of steps and checkpoints will save

Awesome @williamFalcon . Is there any way to just save the model without doing validation during training using val_check_interval or ModelCheckpoint??

https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval

Check val in a set number of steps and checkpoints will save

That is great. Thank you!

same as #1758? so is it working?

It works but will disregard the save_top_k argument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.

https://github.com/PyTorchLightning/pytorch-lightning/blob/8c4c7b105e16fbe255e4715f54af2fa5d2a12fad/pytorch_lightning/callbacks/model_checkpoint.py#L214

My callback looks like:

    checkpoint_callback = ModelCheckpoint(
        save_top_k=10,
        verbose=True,
        monitor='val_loss',
        mode='min',
        period=-1,
    )

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

It would be better if there's a warning message saying that val-check-interval won't work if one doesn't override the validation_step method of Trainer class.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mmsamiei picture mmsamiei  ยท  3Comments

srush picture srush  ยท  3Comments

williamFalcon picture williamFalcon  ยท  3Comments

justusschock picture justusschock  ยท  3Comments

versatran01 picture versatran01  ยท  3Comments