Pytorch-lightning: How to save the model after certain steps instead of epoch?

Created on 13 May 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am trying to train a NN model on a super-big tabular data(about half billion), and I am wondering if I can save the data every certain steps(a million for example) in an epoch instead of every epoch because it indeed spend too much times. I don't know if it is possible in PytorchLightning framework.

Code

What have you tried?

What's your environment?

OS: linux
Packaging conda
Version 0.7.5

question won't fix

Source

IncubatorShokuhou

👍1

Most helpful comment

It works but will disregard the save_top_k argument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.

https://github.com/PyTorchLightning/pytorch-lightning/blob/8c4c7b105e16fbe255e4715f54af2fa5d2a12fad/pytorch_lightning/callbacks/model_checkpoint.py#L214

My callback looks like:

    checkpoint_callback = ModelCheckpoint(
        save_top_k=10,
        verbose=True,
        monitor='val_loss',
        mode='min',
        period=-1,
    )

artidoro on 18 May 2020

👍8

All 9 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 13 May 2020

Don't know if any functionality is present in PyTorchLightning to handle this but you can still save the model within training_step if you want. You get a batch_nb in training_step, you can use that to save the model according to your priorities.

rohitgr7 on 13 May 2020

https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval

Check val in a set number of steps and checkpoints will save

williamFalcon on 13 May 2020

👍2

Awesome @williamFalcon . Is there any way to just save the model without doing validation during training using val_check_interval or ModelCheckpoint??

rohitgr7 on 13 May 2020

👀1 👍1

https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval

Check val in a set number of steps and checkpoints will save

That is great. Thank you!

IncubatorShokuhou on 14 May 2020

same as #1758? so is it working?

awaelchli on 16 May 2020

It works but will disregard the save_top_k argument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.

https://github.com/PyTorchLightning/pytorch-lightning/blob/8c4c7b105e16fbe255e4715f54af2fa5d2a12fad/pytorch_lightning/callbacks/model_checkpoint.py#L214

My callback looks like:

    checkpoint_callback = ModelCheckpoint(
        save_top_k=10,
        verbose=True,
        monitor='val_loss',
        mode='min',
        period=-1,
    )

artidoro on 18 May 2020

👍8

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] on 17 Jul 2020

It would be better if there's a warning message saying that val-check-interval won't work if one doesn't override the validation_step method of Trainer class.