Pytorch-lightning: Limit_train_batches vs val_check_interval

Created on 21 Oct 2020 · 10Comments · Source: PyTorchLightning/pytorch-lightning

Does limit_train_batches=0.5 and val_check_interval=0.5 effectively do the same thing (minus impacting the total number of epochs)? That is, if my data loader is shuffling and I use limit_train_batches, can I safely assume that after 2 epochs I will have gone through the whole dataset or will I only go through the same 50% of the training data twice? The docs are not super clear on this point and this issue is a bit confusing as to what is going on: https://github.com/PyTorchLightning/pytorch-lightning/issues/2928

Also, is this statement true or false depending on single GPU vs DDP?

Thank you!

question won't fix

Source

jwohlwend

All 10 comments

Hi! thanks for your contribution!, great first issue!

github-actions[bot] on 21 Oct 2020

Hey, I have the same question.

A followup question on this, exploring the different semantics between limit_train_batches and val_check_interval:

If we use "val_check_interval" =.5
And my learning rate scheduler patience is 2, does this mean my patience is two full train epochs? or 2 val checks?

Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?

yala on 21 Oct 2020

for @jwohlwend hopefully this notebook would solve your question.

So basically, what limit_train_batches does:

limit the amount of training batches from the dataset you provide (for integer)
limit the percent of training batches from the dataset you provide (for float)

All those limitation is happening after the dataloading parts
So if you provide shuffle=True, the dataloader will shuffle the dataset and put the 1st, 2nd, ... batch into the training_step
If the dataset is small, the dataloader could probably pass all the data.

For DDP, pining @awaelchli for clear explanation

To answer your question, limit_train_batches and val_check_interval are not doing the same thing.
What thing do you think they are doing the same?

what val_check_interval does is checking the full validation set at each val_check_interval of the training epoch.

ydcjeff on 21 Oct 2020

Hi @ydcjeff. Thanks for your response. So just to confirm, if I set limit_train_batches=0.5 and shuffle=True and run for 2 epochs, each epoch will go over a different subset of my full training data? I m worried about basically cutting my training dataset in half. I'm particularly worried about this being the case with DDP because of seeding, so just want to confirm.

The reason I ask whether they are the same is that if I want to run validation more frequently (because my training dataset is super large), do you recommend limit_train_batches or val_check_interval? Seems like you could achieve this with either?

jwohlwend on 21 Oct 2020

Hi @jwohlwend , for DDP @awaelchli could explain better than me
I have limited knowledge for DDP

for single gpu, if you are worried about model is not getting full training dataset and want to run validation more, better use val_check_interval in my opinion. with limit_train_batches, it is possible the limited train batches can be same again after many epochs.

ydcjeff on 21 Oct 2020

👍1

Hi @yala

If we use "val_check_interval" =.5
And my learning rate scheduler patience is 2, does this mean my patience is two full train epochs? or 2 val checks?

~It is 2 val checks, I think.~ See: https://github.com/PyTorchLightning/pytorch-lightning/issues/4288#issuecomment-713714579

Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?

It's at the end of train + val epoch (if present)

ydcjeff on 21 Oct 2020

@ydcjeff Re: learning rate scheduling. Are you are this is true? It seems to me that step() (with epoch interval set) is only called after the end of the epoch: https://github.com/PyTorchLightning/pytorch-lightning/blob/f37444fa3e82011e4a71b5ca8dc897eff9ba0fa3/pytorch_lightning/trainer/trainer.py#L491

So patience of 2 would mean 2 epochs, not 2 val checks

jwohlwend on 21 Oct 2020

@jwohlwend Thank you for that. I misread with EarlyStopping patience 😓

ydcjeff on 21 Oct 2020

👍1

Moreover, is the checkpointing (for save best k models) logic happen at every val check? or only at the end of a full train?

This PR will allow checkpointing after every val check https://github.com/PyTorchLightning/pytorch-lightning/pull/3807

(because my training dataset is super large), do you recommend limit_train_batches or val_check_interval? Seems like you could achieve this with either?

limit_train_batches just limits the dataloader calls to get a batch, so there is a possible case where your model might not even see some data samples. Remember limit_train_batches will end the epoch after this limit is reached but val_check_interval continues the training after validation in the same epoch.

rohitgr7 on 23 Oct 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!