I am trying to train a NN model on a super-big tabular data(about half billion), and I am wondering if I can save the data every certain steps(a million for example) in an epoch instead of every epoch because it indeed spend too much times. I don't know if it is possible in PytorchLightning framework.
Hi! thanks for your contribution!, great first issue!
Don't know if any functionality is present in PyTorchLightning to handle this but you can still save the model within training_step if you want. You get a batch_nb in training_step, you can use that to save the model according to your priorities.
https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval
Check val in a set number of steps and checkpoints will save
Awesome @williamFalcon . Is there any way to just save the model without doing validation during training using val_check_interval or ModelCheckpoint??
https://pytorch-lightning.readthedocs.io/en/stable/trainer.html#val-check-interval
Check val in a set number of steps and checkpoints will save
That is great. Thank you!
same as #1758? so is it working?
It works but will disregard the save_top_k argument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.
My callback looks like:
checkpoint_callback = ModelCheckpoint(
save_top_k=10,
verbose=True,
monitor='val_loss',
mode='min',
period=-1,
)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
It would be better if there's a warning message saying that val-check-interval won't work if one doesn't override the validation_step method of Trainer class.
Most helpful comment
It works but will disregard the
save_top_kargument for checkpoints within an epoch in the ModelCheckpoint. If you want that to work you need to set the period to something negative like -1.https://github.com/PyTorchLightning/pytorch-lightning/blob/8c4c7b105e16fbe255e4715f54af2fa5d2a12fad/pytorch_lightning/callbacks/model_checkpoint.py#L214
My callback looks like: