Pytorch-lightning: How to use ReduceLROnPlateau methon in matster branch version?

Created on 14 Aug 2020 · 26Comments · Source: PyTorchLightning/pytorch-lightning

What is your question?

I find the log design has changed a lot between version 0.8.5 and master branch 0c26468
I got the error message when I follow the docs logging-from-a-lightningmodule to modify the log code.

error message:

MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: val_early_stop_on,val_checkpoint_on,epoch,checkpoint_on. Condition can be set using `monitor` key in lr scheduler dict

Code

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr = 0.01)
    scheduler = ReduceLROnPlateau(optimizer, patience=10)
    return [optimizer], [scheduler]

def validation_step(self, batch, batch_nb):
    x, y = batch

    y_hat = self(x)    
    loss = F.l1_loss(y, y_hat)
    result = pl.EvalResult()
    result.log('val_step_loss', loss)
    return result

def validation_epoch_end(self, outputs):
    avg_loss = outputs.val_step_loss.mean()
    result = pl.EvalResult()
    result.log('val_loss', avg_loss)
    return result

I wonder if I should defined validation_epoch_end like above, if there are any example about how to use ReduceLROnPlateau in a right way?

What's your environment?

OS: Ubuntu 18.04
Packaging pip
Version 0.9.0 rc 12 master branch 0c26468

Metrics bug / fix duplicate

Source

invisprints

Most helpful comment

I got it


    def training_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.TrainResult(loss)
        res.log("train_loss", loss, prog_bar=True)

        return res

    def validation_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.EvalResult(checkpoint_on=loss)
        res.log("val_loss", loss, prog_bar=True)
        return res

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(params=self.parameters(),
                                     lr=self.hparams.lr,
                                     weight_decay=self.hparams.l2_norm)

        lr_scheduler = ReduceLROnPlateau(optimizer, patience=10, factor=0.9, verbose=True)

        scheduler = {
            'scheduler': lr_scheduler,
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

Red-Eyed on 25 Aug 2020

👍4

All 26 comments

it's a bug I think since callback_metrics contains only val_early_stop_on and val_checkpoint_on.
https://github.com/PyTorchLightning/pytorch-lightning/blob/5bce06c05023b9798c42533bc1e7e5868930dcdb/pytorch_lightning/core/step_result.py#L727-L733
https://github.com/PyTorchLightning/pytorch-lightning/blob/5bce06c05023b9798c42533bc1e7e5868930dcdb/pytorch_lightning/trainer/training_loop.py#L1254-L1263

rohitgr7 on 14 Aug 2020

yeah, that's a bug with the new version. adding a fix now.

What should we condition it on? i guess the clear thing maybe is on the checkpoint key?

result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)

this does assume the scheduler currently adjusts using val loop and not train loop?

williamFalcon on 15 Aug 2020

result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)

in that case, if some assigns checkpoint_on to be a metric EvalResult(checkpoint_on=some_metric_tensor) and ReduceLROnPlateau to monitor on val_loss then there might be a conflict here.

rohitgr7 on 15 Aug 2020

the point is that when using ReduceLROnPlateau there is no longer "val_loss"... it will monitor whatever the value of checkpoint_on is

williamFalcon on 15 Aug 2020

val_loss is just an example. My point here is if checkpoint_on=metric1 and monitor=metric2 for ReduceLROnPlateau, in such case it's a conflict. We should not force it to monitor metric1 for ReduceLROnPlateau.

rohitgr7 on 15 Aug 2020

haha. i think you're still missing the point.

the keyword 'monitor' does not have an effect when using evalresults... instead, the ReduceLROnPlateau will look at whatever is on the checkpoint_on

You could set monitor='jiraffe' for ReduceLROnPlateau and it won't matter.
Lightning will use whatever is in checkpoint_on=X

williamFalcon on 15 Aug 2020

yeah, maybe I am missing something here 😅 but not sure how this will work if we take checkpoint_on value for ReduceLROnPlateau:

model_checkpoint = ModelCheckpoint(monitor='val_loss')

...

def validation_epoch_end(self, outputs):
    val_loss = ...
    val_recall = ...
    res = pl.EvalResult(checkpoint_on=val_loss) # not familier with EvalResult but I guess checkpoint_on will be used for modelcheckpoint
    res.log('val_recall', val_recall)
    return res

def configure_optimizers(self):
    optimizer = ...
    scheduler = {'scheduler': ReduceLROnPlateau(), 'interval': 'epoch', monitor: 'val_recall'}
    return [optimizer], [scheduler]

rohitgr7 on 15 Aug 2020

again... monitor has NO effect anywhere... with the results object. doesn’t matter what callback uses the word

williamFalcon on 15 Aug 2020

Had the same question so arrived here.

@williamFalcon, I think what @rohitgr7 means, is that there might be cases where someone wish to use ReduceOnPlatue on metric1 and to save checkpoint on metric2.

i.e, I wish to use ReduceOnPlatue on train_loss (to allow the network to (over)fit in case the lr is not low enough) and use checkpoint_on='val_acc', to save the best model along the training routine.

kessido on 16 Aug 2020

👍3

ok, then we should allow feeding the values logged as options to the callbacks?

williamFalcon on 16 Aug 2020

Not sure myself (I'm not sure of the implementation details).
Would love if you could maybe tag someone who could answer that better.

Ido

kessido on 16 Aug 2020

haha ok. yeah, i think that’s the sensible option since this allows any metric to be monitored by any callback in the future

williamFalcon on 16 Aug 2020

yeah, will solve issues with ModelCheckpoint too. Also, why even add a checkpoint_on/early_stop_on as a parameter there? can't we just use log itself with checkpoint=True/False just like on_epoch/on_step, if we feed logs in the callbacks? Just a suggestion.

rohitgr7 on 16 Aug 2020

because we need a single, unique value to checkpoint/early stop on. with log we can’t enforce that. and also if you want to change what to ckpt on during training, this approach allows that.

ie: in some tasks, i may use loss for a while, but switch to a metric after a certain number of epochs

williamFalcon on 16 Aug 2020

then maybe take the first value in case if multiple checkpoint values are passed with log and raise a warning there?

result.log('metric1', metric1_val, checkpoint=True)
result.log('metric2', metric2_val, checkpoint=True)

take metric1_val for checkpoint.

rohitgr7 on 16 Aug 2020

let's go with the current approach in the API and iterate on it if it causes issues.

williamFalcon on 16 Aug 2020

👍1

I am still confused about how to use ReduceLROnPlateau, are there any simple examples to explain it?

invisprints on 18 Aug 2020

👀1

@williamFalcon I checked the #3004 and find that the monitor seems can only record dicts in training_step, but not validation_step， Here is the code running in Colab (modified from MNIST Hello World).

class LitMNIST(pl.LightningModule):

    def __init__(self, data_dir='./', hidden_size=64, learning_rate=2e-4):

        super().__init__()

        # Set our init args as class attributes
        self.data_dir = data_dir
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate

        # Hardcode some dataset specific attributes
        self.num_classes = 10
        self.dims = (1, 28, 28)
        channels, width, height = self.dims
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])

        # Define PyTorch model
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, self.num_classes)
        )

    def forward(self, x):
        x = self.model(x)
        return F.log_softmax(x, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        result = pl.TrainResult(loss)
        result.log('train_loss', loss)
        return result

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)
        result = pl.EvalResult(checkpoint_on=loss)

        # Calling result.log will surface up scalars for you in TensorBoard
        result.log('val_loss', loss, prog_bar=True)
        result.log('val_acc', acc, prog_bar=True)
        return result

    def test_step(self, batch, batch_idx):
        # Here we just reuse the validation_step for testing
        return self.validation_step(batch, batch_idx)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        scheduler = {'scheduler': lr_scheduler, 'interval': 'step', 'monitor': 'val_loss'}
        return [optimizer], [scheduler]

    ####################
    # DATA RELATED HOOKS
    ####################

    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):

        # Assign train/val datasets for use in dataloaders
        if stage == 'fit' or stage is None:
            mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == 'test' or stage is None:
            self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=32)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=32)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=32)

model = LitMNIST()
trainer = pl.Trainer(gpus=1, max_epochs=3, fast_dev_run=False, progress_bar_refresh_rate=20)
trainer.fit(model)

invisprints on 18 Aug 2020

👀1

@invisprints in case you haven't figured it out, the note sections in this doc would be helpful for you: https://pytorch-lightning.readthedocs.io/en/stable/lightning-module.html#configure-optimizers

So your configure_optimizers() should be something like:

def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        # reduce every epoch (default)
        scheduler = {
            'scheduler': lr_scheduler, 
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

yukw777 on 25 Aug 2020

👍1

@yukw777 isn't *_checkpoint_on specific to checkpoints? I mean one must set the ReduceLROnPlateau monitor equal to what the checkpoint monitor is? They can't be different?

rohitgr7 on 25 Aug 2020

@yukw777 @rohitgr7 yeap, I mean if I want to monitor something different from checkpoint_on monitor, what should I do? Because in test code, it seems we can monitor anything we want, and it works in training step but not in val step.

invisprints on 25 Aug 2020

Decided to rewrite my code to new API, still didn't get how to use ReduceLROnPlateau :(

Red-Eyed on 25 Aug 2020

I got it


    def training_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.TrainResult(loss)
        res.log("train_loss", loss, prog_bar=True)

        return res

    def validation_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.EvalResult(checkpoint_on=loss)
        res.log("val_loss", loss, prog_bar=True)
        return res

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(params=self.parameters(),
                                     lr=self.hparams.lr,
                                     weight_decay=self.hparams.l2_norm)

        lr_scheduler = ReduceLROnPlateau(optimizer, patience=10, factor=0.9, verbose=True)

        scheduler = {
            'scheduler': lr_scheduler,
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

Red-Eyed on 25 Aug 2020

👍4

@invisprints did your question get answered?