Pytorch-lightning: How to use ReduceLROnPlateau methon in matster branch version?

Created on 14 Aug 2020  Â·  26Comments  Â·  Source: PyTorchLightning/pytorch-lightning

What is your question?

I find the log design has changed a lot between version 0.8.5 and master branch 0c26468
I got the error message when I follow the docs logging-from-a-lightningmodule to modify the log code.

error message:

MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: val_early_stop_on,val_checkpoint_on,epoch,checkpoint_on. Condition can be set using `monitor` key in lr scheduler dict

Code

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr = 0.01)
    scheduler = ReduceLROnPlateau(optimizer, patience=10)
    return [optimizer], [scheduler]

def validation_step(self, batch, batch_nb):
    x, y = batch

    y_hat = self(x)    
    loss = F.l1_loss(y, y_hat)
    result = pl.EvalResult()
    result.log('val_step_loss', loss)
    return result

def validation_epoch_end(self, outputs):
    avg_loss = outputs.val_step_loss.mean()
    result = pl.EvalResult()
    result.log('val_loss', avg_loss)
    return result

I wonder if I should defined validation_epoch_end like above, if there are any example about how to use ReduceLROnPlateau in a right way?

What's your environment?

  • OS: Ubuntu 18.04
  • Packaging pip
  • Version 0.9.0 rc 12 master branch 0c26468
Metrics bug / fix duplicate

Most helpful comment

I got it


    def training_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.TrainResult(loss)
        res.log("train_loss", loss, prog_bar=True)

        return res

    def validation_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.EvalResult(checkpoint_on=loss)
        res.log("val_loss", loss, prog_bar=True)
        return res

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(params=self.parameters(),
                                     lr=self.hparams.lr,
                                     weight_decay=self.hparams.l2_norm)

        lr_scheduler = ReduceLROnPlateau(optimizer, patience=10, factor=0.9, verbose=True)

        scheduler = {
            'scheduler': lr_scheduler,
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

All 26 comments

yeah, that's a bug with the new version. adding a fix now.

What should we condition it on? i guess the clear thing maybe is on the checkpoint key?

result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)

this does assume the scheduler currently adjusts using val loop and not train loop?

result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)

in that case, if some assigns checkpoint_on to be a metric EvalResult(checkpoint_on=some_metric_tensor) and ReduceLROnPlateau to monitor on val_loss then there might be a conflict here.

the point is that when using ReduceLROnPlateau there is no longer "val_loss"... it will monitor whatever the value of checkpoint_on is

val_loss is just an example. My point here is if checkpoint_on=metric1 and monitor=metric2 for ReduceLROnPlateau, in such case it's a conflict. We should not force it to monitor metric1 for ReduceLROnPlateau.

haha. i think you're still missing the point.

the keyword 'monitor' does not have an effect when using evalresults... instead, the ReduceLROnPlateau will look at whatever is on the checkpoint_on

You could set monitor='jiraffe' for ReduceLROnPlateau and it won't matter.
Lightning will use whatever is in checkpoint_on=X

yeah, maybe I am missing something here 😅 but not sure how this will work if we take checkpoint_on value for ReduceLROnPlateau:

model_checkpoint = ModelCheckpoint(monitor='val_loss')

...

def validation_epoch_end(self, outputs):
    val_loss = ...
    val_recall = ...
    res = pl.EvalResult(checkpoint_on=val_loss) # not familier with EvalResult but I guess checkpoint_on will be used for modelcheckpoint
    res.log('val_recall', val_recall)
    return res

def configure_optimizers(self):
    optimizer = ...
    scheduler = {'scheduler': ReduceLROnPlateau(), 'interval': 'epoch', monitor: 'val_recall'}
    return [optimizer], [scheduler]

again... monitor has NO effect anywhere... with the results object. doesn’t matter what callback uses the word

Had the same question so arrived here.

@williamFalcon, I think what @rohitgr7 means, is that there might be cases where someone wish to use ReduceOnPlatue on metric1 and to save checkpoint on metric2.

i.e, I wish to use ReduceOnPlatue on train_loss (to allow the network to (over)fit in case the lr is not low enough) and use checkpoint_on='val_acc', to save the best model along the training routine.

ok, then we should allow feeding the values logged as options to the callbacks?

Not sure myself (I'm not sure of the implementation details).
Would love if you could maybe tag someone who could answer that better.

Ido

haha ok. yeah, i think that’s the sensible option since this allows any metric to be monitored by any callback in the future

yeah, will solve issues with ModelCheckpoint too. Also, why even add a checkpoint_on/early_stop_on as a parameter there? can't we just use log itself with checkpoint=True/False just like on_epoch/on_step, if we feed logs in the callbacks? Just a suggestion.

because we need a single, unique value to checkpoint/early stop on. with log we can’t enforce that. and also if you want to change what to ckpt on during training, this approach allows that.

ie: in some tasks, i may use loss for a while, but switch to a metric after a certain number of epochs

then maybe take the first value in case if multiple checkpoint values are passed with log and raise a warning there?

result.log('metric1', metric1_val, checkpoint=True)
result.log('metric2', metric2_val, checkpoint=True)

take metric1_val for checkpoint.

let's go with the current approach in the API and iterate on it if it causes issues.

I am still confused about how to use ReduceLROnPlateau, are there any simple examples to explain it?

@williamFalcon I checked the #3004 and find that the monitor seems can only record dicts in training_step, but not validation_step, Here is the code running in Colab (modified from MNIST Hello World).

class LitMNIST(pl.LightningModule):

    def __init__(self, data_dir='./', hidden_size=64, learning_rate=2e-4):

        super().__init__()

        # Set our init args as class attributes
        self.data_dir = data_dir
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate

        # Hardcode some dataset specific attributes
        self.num_classes = 10
        self.dims = (1, 28, 28)
        channels, width, height = self.dims
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])

        # Define PyTorch model
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, self.num_classes)
        )

    def forward(self, x):
        x = self.model(x)
        return F.log_softmax(x, dim=1)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        result = pl.TrainResult(loss)
        result.log('train_loss', loss)
        return result

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y)
        result = pl.EvalResult(checkpoint_on=loss)

        # Calling result.log will surface up scalars for you in TensorBoard
        result.log('val_loss', loss, prog_bar=True)
        result.log('val_acc', acc, prog_bar=True)
        return result

    def test_step(self, batch, batch_idx):
        # Here we just reuse the validation_step for testing
        return self.validation_step(batch, batch_idx)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        scheduler = {'scheduler': lr_scheduler, 'interval': 'step', 'monitor': 'val_loss'}
        return [optimizer], [scheduler]

    ####################
    # DATA RELATED HOOKS
    ####################

    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):

        # Assign train/val datasets for use in dataloaders
        if stage == 'fit' or stage is None:
            mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == 'test' or stage is None:
            self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=32)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=32)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=32)

model = LitMNIST()
trainer = pl.Trainer(gpus=1, max_epochs=3, fast_dev_run=False, progress_bar_refresh_rate=20)
trainer.fit(model)

@invisprints in case you haven't figured it out, the note sections in this doc would be helpful for you: https://pytorch-lightning.readthedocs.io/en/stable/lightning-module.html#configure-optimizers

So your configure_optimizers() should be something like:

def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
        # reduce every epoch (default)
        scheduler = {
            'scheduler': lr_scheduler, 
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

@yukw777 isn't *_checkpoint_on specific to checkpoints? I mean one must set the ReduceLROnPlateau monitor equal to what the checkpoint monitor is? They can't be different?

@yukw777 @rohitgr7 yeap, I mean if I want to monitor something different from checkpoint_on monitor, what should I do? Because in test code, it seems we can monitor anything we want, and it works in training step but not in val step.

Decided to rewrite my code to new API, still didn't get how to use ReduceLROnPlateau :(

I got it


    def training_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.TrainResult(loss)
        res.log("train_loss", loss, prog_bar=True)

        return res

    def validation_step(self, batch, batch_idx):
        loss = self.calc_loss(batch)

        res = pl.EvalResult(checkpoint_on=loss)
        res.log("val_loss", loss, prog_bar=True)
        return res

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(params=self.parameters(),
                                     lr=self.hparams.lr,
                                     weight_decay=self.hparams.l2_norm)

        lr_scheduler = ReduceLROnPlateau(optimizer, patience=10, factor=0.9, verbose=True)

        scheduler = {
            'scheduler': lr_scheduler,
            'reduce_on_plateau': True,
            # val_checkpoint_on is val_loss passed in as checkpoint_on
            'monitor': 'val_checkpoint_on'
        }
        return [optimizer], [scheduler]

@invisprints did your question get answered?

No, what I want is we can monitor anything we want, not just val_checkpoint_on

marking as duplicate.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

BraveDistribution picture BraveDistribution  Â·  31Comments

hadim picture hadim  Â·  29Comments

Darktex picture Darktex  Â·  26Comments

suvojit-0x55aa picture suvojit-0x55aa  Â·  29Comments

lorenzoFabbri picture lorenzoFabbri  Â·  34Comments