Does limit_train_batches=0.5 and val_check_interval=0.5 effectively do the same thing (minus impacting the total number of epochs)? That is, if my data loader is shuffling and I use limit_train_batches, can I safely assume that after 2 epochs I will have gone through the whole dataset or will I only go through the same 50% of the training data twice? The docs are not super clear on this point and this issue is a bit confusing as to what is going on: https://github.com/PyTorchLightning/pytorch-lightning/issues/2928
Also, is this statement true or false depending on single GPU vs DDP?
Thank you!
Hi! thanks for your contribution!, great first issue!
Hey, I have the same question.
A followup question on this, exploring the different semantics between limit_train_batches and val_check_interval:
If we use "val_check_interval" =.5
And my learning rate scheduler patience is 2, does this mean my patience is two full train epochs? or 2 val checks?
Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?
for @jwohlwend hopefully this notebook would solve your question.
So basically, what limit_train_batches does:
All those limitation is happening after the dataloading parts
So if you provide shuffle=True, the dataloader will shuffle the dataset and put the 1st, 2nd, ... batch into the training_step
If the dataset is small, the dataloader could probably pass all the data.
For DDP, pining @awaelchli for clear explanation
To answer your question, limit_train_batches and val_check_interval are not doing the same thing.
What thing do you think they are doing the same?
what val_check_interval does is checking the full validation set at each val_check_interval of the training epoch.
Hi @ydcjeff. Thanks for your response. So just to confirm, if I set limit_train_batches=0.5 and shuffle=True and run for 2 epochs, each epoch will go over a different subset of my full training data? I m worried about basically cutting my training dataset in half. I'm particularly worried about this being the case with DDP because of seeding, so just want to confirm.
The reason I ask whether they are the same is that if I want to run validation more frequently (because my training dataset is super large), do you recommend limit_train_batches or val_check_interval? Seems like you could achieve this with either?
Hi @jwohlwend , for DDP @awaelchli could explain better than me
I have limited knowledge for DDP
for single gpu, if you are worried about model is not getting full training dataset and want to run validation more, better use val_check_interval in my opinion. with limit_train_batches, it is possible the limited train batches can be same again after many epochs.
Hi @yala
If we use "val_check_interval" =.5
And my learning rate scheduler patience is 2, does this mean my patience is two full train epochs? or 2 val checks?
~It is 2 val checks, I think.~ See: https://github.com/PyTorchLightning/pytorch-lightning/issues/4288#issuecomment-713714579
Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?
It's at the end of train + val epoch (if present)
@ydcjeff Re: learning rate scheduling. Are you are this is true? It seems to me that step() (with epoch interval set) is only called after the end of the epoch: https://github.com/PyTorchLightning/pytorch-lightning/blob/f37444fa3e82011e4a71b5ca8dc897eff9ba0fa3/pytorch_lightning/trainer/trainer.py#L491
So patience of 2 would mean 2 epochs, not 2 val checks
@jwohlwend Thank you for that. I misread with EarlyStopping patience 馃槗
Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?
This PR will allow checkpointing after every val check https://github.com/PyTorchLightning/pytorch-lightning/pull/3807
(because my training dataset is super large), do you recommend limit_train_batches or val_check_interval? Seems like you could achieve this with either?
limit_train_batches just limits the dataloader calls to get a batch, so there is a possible case where your model might not even see some data samples. Remember limit_train_batches will end the epoch after this limit is reached but val_check_interval continues the training after validation in the same epoch.
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!