Pytorch-lightning: Correctly using `ReduceLROnPlateau`

Created on 9 Jan 2020 · 9Comments · Source: PyTorchLightning/pytorch-lightning

Hello all, I'm trying to use the learning rate scheduler ReduceLROnPlateau, though I'm not sure I'm implementing this correctly. The scheduler doesn't seem to be working properly.

I am essentially using the same code as the Colab MNIST tutorial (I ran this in colab)

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super(MNISTModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': avg_loss}
        print(avg_loss)
        return {'val_loss': avg_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': logs, 'progress_bar': logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        optimizer = torch.optim.Adam(self.parameters(), lr=0.02)
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                               mode='min',
                                                               factor=0.2,
                                                               patience=2,
                                                               min_lr=1e-6,
                                                               verbose=True)
        return [optimizer], [scheduler]


    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=512)

The only differences besides the configure_optimizers() method are the batch size (512 vs. 32 originally) and printing (though I don't see how either of these would affect the scheduler behavior).

Question: does the scheduler here automatically receive val_loss computed in the validation_end() step? I've tried running the above code using both avg_val_loss and val_loss as keys in the dictionary returned by validation_end(), and it does not seem to make a difference.

Despite the average validation loss seeming to decrease monotonically, the lr scheduler keeps on reducing the learning rate.

tensor(2.3137, device='cuda:0')
tensor(0.6615, device='cuda:0')

/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer_io.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparameters
  "Did not find hyperparameters at model.hparams. Saving checkpoint without"

tensor(0.6424, device='cuda:0')
tensor(0.6337, device='cuda:0')
tensor(0.6291, device='cuda:0')
Epoch     3: reducing learning rate of group 0 to 4.0000e-03.
tensor(0.6162, device='cuda:0')
tensor(0.6147, device='cuda:0')
tensor(0.6137, device='cuda:0')
Epoch     6: reducing learning rate of group 0 to 8.0000e-04.
tensor(0.6121, device='cuda:0')
tensor(0.6118, device='cuda:0')
tensor(0.6115, device='cuda:0')
Epoch     9: reducing learning rate of group 0 to 1.6000e-04.
tensor(0.6114, device='cuda:0')
tensor(0.6113, device='cuda:0')
tensor(0.6113, device='cuda:0')
Epoch    12: reducing learning rate of group 0 to 3.2000e-05.

Could anyone kindly advise as to how to correctly implement the scheduler? Thank you.

Edit: I forgot to attach the code for the Trainer portion, but it is also essentially the same as in the example.

mnist_model = MNISTModel()

# most basic trainer, uses good defaults (1 gpu)
trainer = pl.Trainer(gpus=1, show_progress_bar=False)    
trainer.fit(mnist_model)

question

Source

balsamfelder

Most helpful comment

@balsamfelder any discoveries? if you still have issues we can reopen. perhaps a tutorial is in order here?

williamFalcon on 21 Jan 2020

👍2 ❤1

All 9 comments

I'm not sure it does, but this is what I've done to solve the issue:

In MNISTModel, you implement optimizer_step like so:

class MNISTModel(pl.LightningModule):
       ....
    def validation_step(self, batch, batch_nb):
        out = self.forward(batch)
        winners = out.argmax(dim=-1)
        correct = (winners == batch.label)
        accuracy = correct.sum().float() / float(correct.size(0))
        return {'val_loss': self.loss(out, batch.label),
                'val_accuracy': accuracy}

    def validation_end(self, outputs):
        avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        avg_acc = torch.stack([x['val_accuracy'] for x in outputs]).mean()
        print(f'Avg val loss: {avg_loss}, Avg val accuracy: {avg_acc}')
        res = {'avg_val_loss': avg_loss,
               'avg_val_accuracy': avg_acc}
        wandb.log(res)

        self.current_val_loss = avg_loss  # save current val loss state for ReduceLROnPlateau scheduler

        return res

    def configure_optimizers(self):
        self.opt = RAdam(self.siamese.parameters(),
                         lr=self.config.lr,
                         betas=self.config.betas,
                         eps=self.config.eps,
                         weight_decay=self.config.weight_decay,
                         degenerated_to_sgd=True)

        self.reduce_lr_on_plateau = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.opt,
            mode='min',
            factor=0.1,
            patience=5,
            verbose=True,
            cooldown=5,
            min_lr=1e-8,
        )

        return [self.opt], [self.reduce_lr_on_plateau]

    def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, second_order_closure=None):
        self.opt.step()
        self.opt.zero_grad()
        if self.trainer.global_step % self.config.val_check_interval == 0:
            self.reduce_lr_on_plateau.step(self.current_val_loss)

So what this is does is: every training iteration, optimizer_step() gets called. Thus if you want to call self.reduce_lr_on_plateau every epoch, you set the val_check_interval to the number of steps in an epoch (ie len(train_dataloader).

Let me know if I wasn't as clear as I hoped.

aced125 on 9 Jan 2020

Hello, thank you for your comment and your help.

I tried running the code with the modifications you suggested, but the behavior seems to be the same. Although the validation loss keeps decreasing, the lr scheduler seems to be decreasing the learning rate. (The amount of epochs at which the lr decreases seems to agree with the patience, but the scheduler seems to think that the loss is increasing.)

$ python mnist.py
...
tensor(2.2982, device='cuda:0')
tensor(1.0003, device='cuda:0')
/home/marc/.pyenv/versions/anaconda3-5.1.0/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_io.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparameters
  "Did not find hyperparameters at model.hparams. Saving checkpoint without"
tensor(0.9753, device='cuda:0')
Epoch     4: reducing learning rate of group 0 to 4.0000e-03.
tensor(0.9639, device='cuda:0')
tensor(0.9605, device='cuda:0')
Epoch     7: reducing learning rate of group 0 to 8.0000e-04.
tensor(0.9582, device='cuda:0')
...

Here is the code I am using:

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super(MNISTModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.val_check_interval = len(self.train_dataloader())
        self.current_val_loss = torch.tensor(float('inf'), device=(
            'cuda' if self.on_gpu else 'cpu'))

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        self.current_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': self.current_val_loss}
        print(self.current_val_loss.data)
        return {'val_loss': self.current_val_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': logs, 'progress_bar': logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        self.opt = torch.optim.Adam(self.parameters(), lr=0.02)
        self.reduce_lr_on_plateau = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.opt,
            mode='max',
            factor=0.2,
            patience=2,
            min_lr=1e-6,
            verbose=True
        )

        return [self.opt], [self.reduce_lr_on_plateau]


    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=512)

    def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i,
                       second_order_closure=None):
        self.opt.step()
        self.opt.zero_grad()
        if self.trainer.global_step % self.val_check_interval == 0:
            self.reduce_lr_on_plateau.step(self.current_val_loss)

mnist_model = MNISTModel()

# most basic trainer, uses good defaults (1 gpu)
trainer = pl.Trainer(gpus=1, show_progress_bar=False)    
trainer.fit(mnist_model)

balsamfelder on 10 Jan 2020

Hi,

This is because you are on mode='max', I believe. This means that if the metric the scheduler is conditioned on (validation loss in your case) _decreases_, then the scheduler will decrease the LR.

To fix the issue, set mode='min' in the scheduler parameters.

aced125 on 12 Jan 2020

Hi,

Thank you. You're absolutely right. Apologies for the error.

However I changed the value to min and the error still persists:

tensor(2.2994, device='cuda:0')
tensor(0.6615, device='cuda:0')
/home/marc/.pyenv/versions/anaconda3-5.1.0/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_io.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparameters
  "Did not find hyperparameters at model.hparams. Saving checkpoint without"
tensor(0.6420, device='cuda:0')
Epoch     4: reducing learning rate of group 0 to 4.0000e-03.
tensor(0.6278, device='cuda:0')
tensor(0.6254, device='cuda:0')
Epoch     7: reducing learning rate of group 0 to 8.0000e-04.
tensor(0.6237, device='cuda:0')
Epoch    10: reducing learning rate of group 0 to 1.6000e-04.
tensor(0.6235, device='cuda:0')
tensor(0.6233, device='cuda:0')
Epoch    13: reducing learning rate of group 0 to 3.2000e-05.

Code:

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms
import pytorch_lightning as pl

class MNISTModel(pl.LightningModule):

    def __init__(self):
        super(MNISTModel, self).__init__()
        # not the best model...
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.val_check_interval = len(self.train_dataloader())
        self.current_val_loss = torch.tensor(float('inf'), device=(
            'cuda' if self.on_gpu else 'cpu'))

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        # REQUIRED
        x, y = batch
        y_hat = self.forward(x)
        loss = F.cross_entropy(y_hat, y)
        tensorboard_logs = {'train_loss': loss}
        return {'loss': loss, 'log': tensorboard_logs}

    def validation_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'val_loss': F.cross_entropy(y_hat, y)}

    def validation_end(self, outputs):
        # OPTIONAL
        self.current_val_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
        tensorboard_logs = {'val_loss': self.current_val_loss}
        print(self.current_val_loss.data)
        return {'val_loss': self.current_val_loss, 'log': tensorboard_logs}

    def test_step(self, batch, batch_nb):
        # OPTIONAL
        x, y = batch
        y_hat = self.forward(x)
        return {'test_loss': F.cross_entropy(y_hat, y)}

    def test_end(self, outputs):
        # OPTIONAL
        avg_loss = torch.stack([x['test_loss'] for x in outputs]).mean()
        logs = {'test_loss': avg_loss}
        return {'avg_test_loss': avg_loss, 'log': logs, 'progress_bar': logs}

    def configure_optimizers(self):
        # REQUIRED
        # can return multiple optimizers and learning_rate schedulers
        # (LBFGS it is automatically supported, no need for closure function)
        self.opt = torch.optim.Adam(self.parameters(), lr=0.02)
        self.reduce_lr_on_plateau = torch.optim.lr_scheduler.ReduceLROnPlateau(
            self.opt,
            mode='min',
            factor=0.2,
            patience=2,
            min_lr=1e-6,
            verbose=True
        )

        return [self.opt], [self.reduce_lr_on_plateau]


    @pl.data_loader
    def train_dataloader(self):
        # REQUIRED
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def val_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=True, download=True, transform=transforms.ToTensor()), batch_size=512)

    @pl.data_loader
    def test_dataloader(self):
        # OPTIONAL
        return DataLoader(MNIST(os.getcwd(), train=False, download=True, transform=transforms.ToTensor()), batch_size=512)

    def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i,
                       second_order_closure=None):
        self.opt.step()
        self.opt.zero_grad()
        if self.trainer.global_step % self.val_check_interval == 0:
            self.reduce_lr_on_plateau.step(self.current_val_loss)

mnist_model = MNISTModel()

# most basic trainer, uses good defaults (1 gpu)
trainer = pl.Trainer(gpus=1, show_progress_bar=False)    
trainer.fit(mnist_model)

balsamfelder on 14 Jan 2020

Weird ...

Might be worth looking through the source code to see what's going on under the hood, is my only suggestion.

Laksh1997 on 16 Jan 2020

@balsamfelder any discoveries? if you still have issues we can reopen. perhaps a tutorial is in order here?

williamFalcon on 21 Jan 2020

👍2 ❤1

To get the intended behavior do not return the scheduler in configure_optimizers, e.g.

def configure_optimizers(self):
        optimizer = favorite_optimizer       
        self.scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)         
        return optimizer 

def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, second_order_closure=None):              
        if batch_nb == 0: # to call the scheduler after each validation
            self.scheduler.step(self.favorite_metric)
            print(f'metric: {self.favorite_metric}, best: {self.scheduler.best}, num_bad_epochs: {self.scheduler.num_bad_epochs}') # for debugging
        optimizer.step()
        optimizer.zero_grad()

Otherwise, scheduler.step gets called with val_loss as metric (or even twice for different metrics in your example), see https://github.com/PyTorchLightning/pytorch-lightning/blob/b35c472bb17d170102fd0b987655462b7e3304d3/pytorch_lightning/trainer/training_loop.py#L339-L346

juliusberner on 30 Jan 2020

👍1

Yep - as a quick fix we could add an option to the pl.trainer which is the metric to condition on ie:
plateau_metric='val_loss' or plateau_metric='auc' etc...

Right now I need to trick PL by doing this at the end of validation_end: return {"val_loss": auc}

Laksh1997 on 6 Feb 2020

threshold_mode parameter of ReduceLROnPlateau might explain the phenomenon. If threshold_mode=rel is used, and loss is not reducing "significantly", learning rate will reduce.