Pytorch-lightning: overfit_batches doesn't work

Created on 22 Jun 2020  路  15Comments  路  Source: PyTorchLightning/pytorch-lightning

When I try to use overfit_batches:
https://pytorch-lightning.readthedocs.io/en/latest/debugging.html#make-model-overfit-on-subset-of-data

 trainer = Trainer(gpus=num_gpus, max_epochs=config.epochs, overfit_batches=0.01, logger=logger)

my code fails with:

   trainer.fit(module)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
    self.single_gpu_train(model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1065, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 331, in reset_val_dataloader
    self._reset_eval_dataloader(model, 'val')
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 314, in _reset_eval_dataloader
    f'you requested to check {limit_eval_batches} of the {mode} dataloader but'
pytorch_lightning.utilities.exceptions.MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091

P.S.: I also tried setting limit_val_batches=0.09090909090909091. Same error.

Priority P0 bug / fix help wanted

Most helpful comment

I just looked at it. In summary:

  1. OP had this problem:
    MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091
    This message is correct, it is telling you that the percentage you have chosen corresponds to less than one batch.
    Solution:
    You need to increase the value. But what you probably want is overfit_batches=1, and this works with exactly one batch without error.
  1. The example code by @willprice works on master #3501
  2. The example by @itsikad shows an issue with DDP. As far as I can tell, this is the only remaining problem in this thread here. I can take a look

All 15 comments

Did you check if the length of your dataloader (how many iteration it has) is different than 0?

Not sure what you mean, but I am able to train the regular way... without the overfit_batches setting.

Does the validation step of your model goes without any problem?

Oh yeah.

Not sure what you mean, but I am able to train the regular way... without the overfit_batches setting.

can you check the value of len(valid_dataloader)??

I've also tried using overfit_batches on MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hitting max_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reached train_loss=0.02 in just 4 epochs.

When I try to use overfit_batches:
https://pytorch-lightning.readthedocs.io/en/latest/debugging.html#make-model-overfit-on-subset-of-data

 trainer = Trainer(gpus=num_gpus, max_epochs=config.epochs, overfit_batches=0.01, logger=logger)

my code fails with:

   trainer.fit(module)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
    self.single_gpu_train(model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
    self.run_pretrain_routine(model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1065, in run_pretrain_routine
    self.reset_val_dataloader(ref_model)
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 331, in reset_val_dataloader
    self._reset_eval_dataloader(model, 'val')
  File "/home/andriy/miniconda3/envs/patchy_discs_model/lib/python3.7/site-packages/pytorch_lightning/trainer/data_loading.py", line 314, in _reset_eval_dataloader
    f'you requested to check {limit_eval_batches} of the {mode} dataloader but'
pytorch_lightning.utilities.exceptions.MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091

P.S.: I also tried setting limit_val_batches=0.09090909090909091. Same error.

try using a slightly bigger overfit_batches (e.g. overfit_batches=0.1) or number of batches (e.g. overfit_batches=10)

I've also tried using overfit_batches on MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hitting max_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reached train_loss=0.02 in just 4 epochs.

Probably the training get's stuck in local optima.

I've also tried using overfit_batches on MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hitting max_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reached train_loss=0.02 in just 4 epochs.

overfit_batches just reduces your num_batches so that it can overfit your model on a small batch to check whether the model can adapt your dataset or not. It will still run for n epochs even if you set overfit_batches to any value.

@Kshitij09 you also need to adjust your learning rate... otherwise it might get stuck in a local min (likely lower your lr)

I've also tried using overfit_batches on MNIST dataset and it didn't work. The training continues for 1000 epochs and stops (hitting max_epochs). I observed the loss fluctuating around 0.1 to 0.2 all the time whereas, in actual training, my model reached train_loss=0.02 in just 4 epochs.

overfit_batches just reduces your num_batches so that it can overfit your model on a small batch to check whether the model can adapt your dataset or not. It will still run for n epochs even if you set overfit_batches to any value.

@rohitgr7 so do I need to couple it with early_stop_callback ?

@Kshitij09 you also need to adjust your learning rate... otherwise it might get stuck in a local min (likely lower your lr)

@williamFalcon yes I've also incorporated ReduceLROnPlateau with this which dropped lr upto 1e-8/9 but didn't stop training

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

import os
import time

import torch
import torch.nn.functional as F
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchvision import transforms
import pytorch_lightning as pl


class LitClassifier(pl.LightningModule):

    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.i = 0

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, (y, idxs) = batch
        print(f"training step {self.i}, batch_idx {batch_idx}, items: {idxs.cpu().numpy()}")
        self.i += 1
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        result = pl.TrainResult(loss)
        result.log('train_loss', loss, on_epoch=True)
        return result

    def validation_step(self, batch, batch_idx):
        x, (y, idxs) = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        result = pl.EvalResult(checkpoint_on=loss)
        result.log('val_loss', loss)
        return result

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)


class MNISTDataset(MNIST):
    def __getitem__(self, item):
        x, y = super().__getitem__(item)
        return (x, (y, item))


# train!
dataset = MNISTDataset(os.getcwd(), download=True, transform=transforms.ToTensor())
train, val = random_split(dataset, [55000, 5000])

model = LitClassifier()
trainer = pl.Trainer(overfit_batches=1, gpus=1, progress_bar_refresh_rate=0)
trainer.fit(model, DataLoader(train, shuffle=False, batch_size=4, num_workers=0),
            DataLoader(val, batch_size=4, num_workers=0))

Produces

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not l
og computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
warnings.warn(*args, **kwargs)

| Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7 K
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
training step 0, batch_idx 0, items: [13556 55560 49266  5079]
training step 1, batch_idx 0, items: [13556 55560 49266  5079]
training step 2, batch_idx 0, items: [13556 55560 49266  5079]
training step 3, batch_idx 0, items: [13556 55560 49266  5079]
training step 4, batch_idx 0, items: [13556 55560 49266  5079]
training step 5, batch_idx 0, items: [13556 55560 49266  5079]
training step 6, batch_idx 0, items: [13556 55560 49266  5079]
training step 7, batch_idx 0, items: [13556 55560 49266  5079]
training step 8, batch_idx 0, items: [13556 55560 49266  5079]
training step 9, batch_idx 0, items: [13556 55560 49266  5079]
training step 10, batch_idx 0, items: [13556 55560 49266  5079]
training step 11, batch_idx 0, items: [13556 55560 49266  5079]
training step 12, batch_idx 0, items: [13556 55560 49266  5079]
training step 13, batch_idx 0, items: [13556 55560 49266  5079]
training step 14, batch_idx 0, items: [13556 55560 49266  5079]
training step 15, batch_idx 0, items: [13556 55560 49266  5079]
training step 16, batch_idx 0, items: [13556 55560 49266  5079]
training step 17, batch_idx 0, items: [13556 55560 49266  5079]
training step 18, batch_idx 0, items: [13556 55560 49266  5079]
training step 19, batch_idx 0, items: [13556 55560 49266  5079]
training step 20, batch_idx 0, items: [13556 55560 49266  5079]
training step 21, batch_idx 0, items: [13556 55560 49266  5079]
training step 22, batch_idx 0, items: [13556 55560 49266  5079]
training step 23, batch_idx 0, items: [13556 55560 49266  5079]
training step 24, batch_idx 0, items: [13556 55560 49266  5079]

Toggling shuffle=True results in

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: Could not l
og computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
warnings.warn(*args, **kwargs)

| Name | Type   | Params
--------------------------------
0 | l1   | Linear | 7 K
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: You request
ed to overfit but enabled training dataloader shuffling. We are turning it off for you.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: The dataloa
der, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try
104 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
warnings.warn(*args, **kwargs)
training step 0, batch_idx 0, items: [19838 40946 18620 21942]
training step 1, batch_idx 0, items: [59940 37871 44153  6899]
training step 2, batch_idx 0, items: [53096 12012 22479 27454]
training step 3, batch_idx 0, items: [12640 55771 22517 27844]
training step 4, batch_idx 0, items: [35820 56735 43191 58511]
training step 5, batch_idx 0, items: [35818 38129  4901 46901]
training step 6, batch_idx 0, items: [21038 14631 15166 15581]
training step 7, batch_idx 0, items: [ 7095 15539  8672 39255]
training step 8, batch_idx 0, items: [ 6397 24324 27822 53308]
training step 9, batch_idx 0, items: [ 7261 45991 58502 38393]
training step 10, batch_idx 0, items: [50646 43129  4348 32436]
training step 11, batch_idx 0, items: [11271 13858 11991 43261]
training step 12, batch_idx 0, items: [29346 42714 52281 36790]
training step 13, batch_idx 0, items: [21324 32598 43017  8024]
training step 14, batch_idx 0, items: [30809 50140  5554 36657]
training step 15, batch_idx 0, items: [ 1462  4226 44369 40183]
training step 16, batch_idx 0, items: [53579 10375 22340  4105]
training step 17, batch_idx 0, items: [47785 10585 12661 35176]
training step 18, batch_idx 0, items: [16489 26748 25997  8344]
training step 19, batch_idx 0, items: [38492 45758 56593 37933]
training step 20, batch_idx 0, items: [  527  4662 29285 26215]
training step 21, batch_idx 0, items: [ 1838 42586  9805 13441]
training step 22, batch_idx 0, items: [12649 11892  4140 56752]
training step 23, batch_idx 0, items: [48902 57464 57910 54211]
training step 24, batch_idx 0, items: [38774 16780 10018 49934]

Note the warning

/home/willprice/.conda/envs/pytorch/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:37: UserWarning: You request
ed to overfit but enabled training dataloader shuffling. We are turning it off for you.

Is incorrect and very misleading to the user.

Similar issue here.

With DDP backend the trainer keeps drawing random samples even if shuffle is manually set to False (works fine with DP backend).

Other configs/params:
gpus=2
batch_size=1
overfit_batches=4
accumulate_grad_batches=1
num_workers=2 (larger values resulted in the trainer draw overfit_batchesnum_workersbatch_size random samples every epoch which is hard to follow)

Minimal code for reproduction (I also tried with different number of batches, batch size, num workers, etc.)

import torch
from torch.nn import Conv2d
from torch.optim import SGD
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning.metrics.regression import MSE
import pytorch_lightning as pl
from pytorch_lightning import Trainer


class MyDataset(Dataset):
    def __init__(self, size=100):
        super(MyDataset, self).__init__()
        self.data = torch.stack([idx * torch.ones(3,100,100) for idx in range(size)])
        self.idx_list = []

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return self.data.shape[0]

class MyModel(pl.LightningModule):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv_1 = Conv2d(in_channels=3, out_channels=3, kernel_size=1, stride=1)
        self.loss = MSE()
        self.idx_list = []
    def forward(self, batch):
        return self.conv_1(batch)

    def training_step(self, batch, batch_idx):
        idx = batch[0,0,0,0].detach()
        pred = self.forward(batch)
        loss = self.loss(pred, batch)
        return {'loss': loss, 'idx': idx}

    def training_epoch_end(self, outputs):
        idx_list = torch.tensor([x['idx'] for x in outputs])
        print('Epoch: {}, device: {} samples: {}'.format(self.current_epoch, self.device, idx_list))
        return torch.stack([x['loss'] for x in outputs]).mean()

    def setup(self, stage):
        self.dataset = MyDataset()

    def train_dataloader(self):
        loader = DataLoader(self.dataset, batch_size=1, num_workers=20, pin_memory=True, shuffle=False)
        return loader

    def configure_optimizers(self):
        return SGD(self.parameters(), lr=0.001)


def main():

    pl_model = MyModel()
    # trainer = Trainer(distributed_backend='ddp', num_nodes=1, gpus=2, overfit_batches=4)
    trainer = Trainer(distributed_backend='ddp', gpus=2, overfit_batches=5, max_epochs=4, check_val_every_n_epoch=100)
    trainer.fit(pl_model)


if __name__ == '__main__':
    main()

Output (ddp backend):

Epoch: 0, device: cuda:0 samples: tensor([44., 93., 71., 37., 53.])
Epoch: 0, device: cuda:1 samples: tensor([19., 90., 69., 95., 91.])

Epoch: 1, device: cuda:0 samples: tensor([45., 90., 35., 17., 79.])
Epoch: 1, device: cuda:1 samples: tensor([15., 32., 63., 72., 96.])

Epoch: 2, device: cuda:0 samples: tensor([48., 1., 90., 10., 7.])
Epoch: 2, device: cuda:1 samples: tensor([97., 81., 49., 8., 20.])

Epoch: 3, device: cuda:0 samples: tensor([86., 89., 3., 22., 25.])
Epoch: 3, device: cuda:1 samples: tensor([42., 92., 20., 48., 93.])

Output (dp backend):

Epoch: 0, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 1, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 2, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])
Epoch: 3, device: cuda:0 samples: tensor([0., 1., 2., 3., 4.])

I just looked at it. In summary:

  1. OP had this problem:
    MisconfigurationException: you requested to check 0.01 of the val dataloader but 0.01*0 = 0. Please increase the limit_val_batches. Try at least limit_val_batches=0.09090909090909091
    This message is correct, it is telling you that the percentage you have chosen corresponds to less than one batch.
    Solution:
    You need to increase the value. But what you probably want is overfit_batches=1, and this works with exactly one batch without error.
  1. The example code by @willprice works on master #3501
  2. The example by @itsikad shows an issue with DDP. As far as I can tell, this is the only remaining problem in this thread here. I can take a look
Was this page helpful?
0 / 5 - 0 ratings

Related issues

anthonytec2 picture anthonytec2  路  3Comments

monney picture monney  路  3Comments

baeseongsu picture baeseongsu  路  3Comments

iakremnev picture iakremnev  路  3Comments

srush picture srush  路  3Comments