When I try to use overfit_batches:
https://pytorch-lightning.readthedocs.io/en/latest/debugging.html#make-model-overfit-on-subset-of-data
trainer = Trainer(gpus=num_gpus, max_epochs=config.epochs, overfit_batches=0.01, logger=logger)
my code fails with:
trainer.fit(module)
File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
self.single_gpu_train(model)
File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
self.run_pretrain_routine(model)
File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1065, in run_pretrain_routine
self.reset_val_dataloader(ref_model)
File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 331, in reset_val_dataloader
self._reset_eval_dataloader(model, 'val')
File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 314, in _reset_eval_dataloader
f'you requested to check {limit_eval_batches} of the {mode} dataloader but'
pytorch_lightning.utilities.exceptions.MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091
P.S.: I also tried setting limit_val_batches=0.09090909090909091. Same error.
Did you check if the length of your dataloader (how many iteration it has) is different than 0?
Not sure what you mean, but I am able to train the regular way... without the overfit_batches setting.
Does the validation step of your model goes without any problem?
Oh yeah.
Not sure what you mean, but I am able to train the regular way... without the
overfit_batchessetting.
can you check the value of len(valid_dataloader)??
I've also tried using overfit_batches on MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hitting max_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reached train_loss=0.02 in just 4 epochs.
When I try to use overfit_batches:
https://pytorch-lightning.readthedocs.io/en/latest/debugging.html#make-model-overfit-on-subset-of-datatrainer = Trainer(gpus=num_gpus, max_epochs=config.epochs, overfit_batches=0.01, logger=logger)my code fails with:
trainer.fit(module) File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit self.single_gpu_train(model) File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train self.run_pretrain_routine(model) File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1065, in run_pretrain_routine self.reset_val_dataloader(ref_model) File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 331, in reset_val_dataloader self._reset_eval_dataloader(model, 'val') File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 314, in _reset_eval_dataloader f'you requested to check {limit_eval_batches} of the {mode} dataloader but' pytorch_lightning.utilities.exceptions.MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091P.S.: I also tried setting
limit_val_batches=0.09090909090909091. Same error.
try using a slightly bigger overfit_batches (e.g. overfit_batches=0.1) or number of batches (e.g. overfit_batches=10)
I've also tried using
overfit_batcheson MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hittingmax_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reachedtrain_loss=0.02in just 4 epochs.
Probably the training get's stuck in local optima.
I've also tried using
overfit_batcheson MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hittingmax_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reachedtrain_loss=0.02in just 4 epochs.
overfit_batches just reduces your num_batches so that it can overfit your model on a small batch to check whether the model can adapt your dataset or not. It will still run for n epochs even if you set overfit_batches to any value.
@Kshitij09 you also need to adjust your learning rate... otherwise it might get stuck in a local min (likely lower your lr)
I've also tried using
overfit_batcheson MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hittingmax_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reachedtrain_loss=0.02in just 4 epochs.
overfit_batchesjust reduces your num_batches so that it can overfit your model on a small batch to check whether the model can adapt your dataset or not. It will still run fornepochs even if you setoverfit_batchesto any value.
@rohitgr7 so do I need to couple it with early_stop_callback ?
@Kshitij09 you also need to adjust your learning rate... otherwise it might get stuck in a local min (likely lower your lr)
@williamFalcon yes I've also incorporated ReduceLROnPlateau with this which dropped lr upto 1e-8/9 but didn't stop training
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
import os
import time
import torch
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl
class LitClassifier(pl.LightningModule):
def __init__(self):
super().__init__()
self.l1 = torch.nn.Linear(28 * 28, 10)
self.i = 0
def forward(self, x):
return torch.relu(self.l1(x.view(x.size(0), -1)))
def training_step(self, batch, batch_idx):
x, (y, idxs) = batch
print(f"training step {self.i}, batch_idx {batch_idx}, items: {idxs.cpu().numpy()}")
self.i += 1
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.TrainResult(loss)
result.log('train_loss', loss, on_epoch=True)
return result
def validation_step(self, batch, batch_idx):
x, (y, idxs) = batch
y_hat = self(x)
loss = F.cross_entropy(y_hat, y)
result = pl.EvalResult(checkpoint_on=loss)
result.log('val_loss', loss)
return result
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
class MNISTDataset(MNIST):
def __getitem__(self, item):
x, y = super().__getitem__(item)
return (x, (y, item))
# train!
dataset = MNISTDataset(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [55000, 5000])
model = LitClassifier()
trainer = pl.Trainer(overfit_batches=1, gpus=1, progress_bar_refresh_rate=0)
trainer.fit(model, DataLoader(train, shuffle=False, batch_size=4, num_workers=0),
DataLoader(val, batch_size=4, num_workers=0))
Produces
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not l
og computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
warnings.warn(*args, **kwargs)
| Name | Type | Params
--------------------------------
0 | l1 | Linear | 7 K
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
training step 0, batch_idx 0, items: [13556 55560 49266 5079]
training step 1, batch_idx 0, items: [13556 55560 49266 5079]
training step 2, batch_idx 0, items: [13556 55560 49266 5079]
training step 3, batch_idx 0, items: [13556 55560 49266 5079]
training step 4, batch_idx 0, items: [13556 55560 49266 5079]
training step 5, batch_idx 0, items: [13556 55560 49266 5079]
training step 6, batch_idx 0, items: [13556 55560 49266 5079]
training step 7, batch_idx 0, items: [13556 55560 49266 5079]
training step 8, batch_idx 0, items: [13556 55560 49266 5079]
training step 9, batch_idx 0, items: [13556 55560 49266 5079]
training step 10, batch_idx 0, items: [13556 55560 49266 5079]
training step 11, batch_idx 0, items: [13556 55560 49266 5079]
training step 12, batch_idx 0, items: [13556 55560 49266 5079]
training step 13, batch_idx 0, items: [13556 55560 49266 5079]
training step 14, batch_idx 0, items: [13556 55560 49266 5079]
training step 15, batch_idx 0, items: [13556 55560 49266 5079]
training step 16, batch_idx 0, items: [13556 55560 49266 5079]
training step 17, batch_idx 0, items: [13556 55560 49266 5079]
training step 18, batch_idx 0, items: [13556 55560 49266 5079]
training step 19, batch_idx 0, items: [13556 55560 49266 5079]
training step 20, batch_idx 0, items: [13556 55560 49266 5079]
training step 21, batch_idx 0, items: [13556 55560 49266 5079]
training step 22, batch_idx 0, items: [13556 55560 49266 5079]
training step 23, batch_idx 0, items: [13556 55560 49266 5079]
training step 24, batch_idx 0, items: [13556 55560 49266 5079]
Toggling shuffle=True results in
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not l
og computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
warnings.warn(*args, **kwargs)
| Name | Type | Params
--------------------------------
0 | l1 | Linear | 7 K
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: You request
ed to overfit but enabled training dataloader shuffling. We are turning it off for you.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
training step 0, batch_idx 0, items: [19838 40946 18620 21942]
training step 1, batch_idx 0, items: [59940 37871 44153 6899]
training step 2, batch_idx 0, items: [53096 12012 22479 27454]
training step 3, batch_idx 0, items: [12640 55771 22517 27844]
training step 4, batch_idx 0, items: [35820 56735 43191 58511]
training step 5, batch_idx 0, items: [35818 38129 4901 46901]
training step 6, batch_idx 0, items: [21038 14631 15166 15581]
training step 7, batch_idx 0, items: [ 7095 15539 8672 39255]
training step 8, batch_idx 0, items: [ 6397 24324 27822 53308]
training step 9, batch_idx 0, items: [ 7261 45991 58502 38393]
training step 10, batch_idx 0, items: [50646 43129 4348 32436]
training step 11, batch_idx 0, items: [11271 13858 11991 43261]
training step 12, batch_idx 0, items: [29346 42714 52281 36790]
training step 13, batch_idx 0, items: [21324 32598 43017 8024]
training step 14, batch_idx 0, items: [30809 50140 5554 36657]
training step 15, batch_idx 0, items: [ 1462 4226 44369 40183]
training step 16, batch_idx 0, items: [53579 10375 22340 4105]
training step 17, batch_idx 0, items: [47785 10585 12661 35176]
training step 18, batch_idx 0, items: [16489 26748 25997 8344]
training step 19, batch_idx 0, items: [38492 45758 56593 37933]
training step 20, batch_idx 0, items: [ 527 4662 29285 26215]
training step 21, batch_idx 0, items: [ 1838 42586 9805 13441]
training step 22, batch_idx 0, items: [12649 11892 4140 56752]
training step 23, batch_idx 0, items: [48902 57464 57910 54211]
training step 24, batch_idx 0, items: [38774 16780 10018 49934]
Note the warning
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: You request
ed to overfit but enabled training dataloader shuffling. We are turning it off for you.
Is incorrect and very misleading to the user.
Similar issue here.
With DDP backend the trainer keeps drawing random samples even if shuffle is manually set to False (works fine with DP backend).
Other configs/params:
gpus=2
batch_size=1
overfit_batches=4
accumulate_grad_batches=1
num_workers=2 (larger values resulted in the trainer draw overfit_batchesnum_workersbatch_size random samples every epoch which is hard to follow)
Minimal code for reproduction (I also tried with different number of batches, batch size, num workers, etc.)
import torch
from torch.nn import Conv2d
from torch.optim import SGD
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning.metrics.regression import MSE
import pytorch_lightning as pl
from pytorch_lightning import Trainer
class MyDataset(Dataset):
def __init__(self, size=100):
super(MyDataset, self).__init__()
self.data = torch.stack([idx * torch.ones(3,100,100) for idx in range(size)])
self.idx_list = []
def __getitem__(self, idx):
return self.data[idx]
def __len__(self):
return self.data.shape[0]
class MyModel(pl.LightningModule):
def __init__(self):
super(MyModel, self).__init__()
self.conv_1 = Conv2d(in_channels=3, out_channels=3, kernel_size=1, stride=1)
self.loss = MSE()
self.idx_list = []
def forward(self, batch):
return self.conv_1(batch)
def training_step(self, batch, batch_idx):
idx = batch[0,0,0,0].detach()
pred = self.forward(batch)
loss = self.loss(pred, batch)
return {'loss': loss, 'idx': idx}
def training_epoch_end(self, outputs):
idx_list = torch.tensor([x['idx'] for x in outputs])
print('Epoch: {}, device: {} samples: {}'.format(self.current_epoch, self.device, idx_list))
return torch.stack([x['loss'] for x in outputs]).mean()
def setup(self, stage):
self.dataset = MyDataset()
def train_dataloader(self):
loader = DataLoader(self.dataset, batch_size=1, num_workers=20, pin_memory=True, shuffle=False)
return loader
def configure_optimizers(self):
return SGD(self.parameters(), lr=0.001)
def main():
pl_model = MyModel()
# trainer = Trainer(distributed_backend='ddp', num_nodes=1, gpus=2, overfit_batches=4)
trainer = Trainer(distributed_backend='ddp', gpus=2, overfit_batches=5, max_epochs=4, check_val_every_n_epoch=100)
trainer.fit(pl_model)
if __name__ == '__main__':
main()
Output (ddp backend):
Epoch: 0, device: cuda:0 samples: tensor([44., 93., 71., 37., 53.])
Epoch: 0, device: cuda:1 samples: tensor([19., 90., 69., 95., 91.])Epoch: 1, device: cuda:0 samples: tensor([45., 90., 35., 17., 79.])
Epoch: 1, device: cuda:1 samples: tensor([15., 32., 63., 72., 96.])Epoch: 2, device: cuda:0 samples: tensor([48., 1., 90., 10., 7.])
Epoch: 2, device: cuda:1 samples: tensor([97., 81., 49., 8., 20.])Epoch: 3, device: cuda:0 samples: tensor([86., 89., 3., 22., 25.])
Epoch: 3, device: cuda:1 samples: tensor([42., 92., 20., 48., 93.])
Output (dp backend):
Epoch: 0, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 1, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 2, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 3, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
I just looked at it. In summary:
MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091overfit_batches=1, and this works with exactly one batch without error.
Most helpful comment
I just looked at it. In summary:
MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091This message is correct, it is telling you that the percentage you have chosen corresponds to less than one batch.
Solution:
You need to increase the value. But what you probably want is
overfit_batches=1, and this works with exactly one batch without error.