I find the log design has changed a lot between version 0.8.5 and master branch 0c26468
I got the error message when I follow the docs logging-from-a-lightningmodule to modify the log code.
error message:
MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: val_early_stop_on,val_checkpoint_on,epoch,checkpoint_on. Condition can be set using `monitor` key in lr scheduler dict
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr = 0.01)
scheduler = ReduceLROnPlateau(optimizer, patience=10)
return [optimizer], [scheduler]
def validation_step(self, batch, batch_nb):
x, y = batch
y_hat = self(x)
loss = F.l1_loss(y, y_hat)
result = pl.EvalResult()
result.log('val_step_loss', loss)
return result
def validation_epoch_end(self, outputs):
avg_loss = outputs.val_step_loss.mean()
result = pl.EvalResult()
result.log('val_loss', avg_loss)
return result
I wonder if I should defined validation_epoch_end
like above, if there are any example about how to use ReduceLROnPlateau
in a right way?
it's a bug I think since callback_metrics contains only val_early_stop_on
and val_checkpoint_on
.
https://github.com/PyTorchLightning/pytorch-lightning/blob/5bce06c05023b9798c42533bc1e7e5868930dcdb/pytorch_lightning/core/step_result.py#L727-L733
https://github.com/PyTorchLightning/pytorch-lightning/blob/5bce06c05023b9798c42533bc1e7e5868930dcdb/pytorch_lightning/trainer/training_loop.py#L1254-L1263
yeah, that's a bug with the new version. adding a fix now.
What should we condition it on? i guess the clear thing maybe is on the checkpoint key?
result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)
this does assume the scheduler currently adjusts using val loop and not train loop?
result = EvalResult(checkpoint_on=the_thing_to_lr_reduce_on)
in that case, if some assigns checkpoint_on to be a metric EvalResult(checkpoint_on=some_metric_tensor)
and ReduceLROnPlateau to monitor on val_loss
then there might be a conflict here.
the point is that when using ReduceLROnPlateau there is no longer "val_loss"... it will monitor whatever the value of checkpoint_on
is
val_loss
is just an example. My point here is if checkpoint_on=metric1
and monitor=metric2
for ReduceLROnPlateau, in such case it's a conflict. We should not force it to monitor metric1
for ReduceLROnPlateau.
haha. i think you're still missing the point.
the keyword 'monitor' does not have an effect when using evalresults... instead, the ReduceLROnPlateau will look at whatever is on the checkpoint_on
You could set monitor='jiraffe' for ReduceLROnPlateau and it won't matter.
Lightning will use whatever is in checkpoint_on=X
yeah, maybe I am missing something here 😅 but not sure how this will work if we take checkpoint_on
value for ReduceLROnPlateau
:
model_checkpoint = ModelCheckpoint(monitor='val_loss')
...
def validation_epoch_end(self, outputs):
val_loss = ...
val_recall = ...
res = pl.EvalResult(checkpoint_on=val_loss) # not familier with EvalResult but I guess checkpoint_on will be used for modelcheckpoint
res.log('val_recall', val_recall)
return res
def configure_optimizers(self):
optimizer = ...
scheduler = {'scheduler': ReduceLROnPlateau(), 'interval': 'epoch', monitor: 'val_recall'}
return [optimizer], [scheduler]
again... monitor has NO effect anywhere... with the results object. doesn’t matter what callback uses the word
Had the same question so arrived here.
@williamFalcon, I think what @rohitgr7 means, is that there might be cases where someone wish to use ReduceOnPlatue on metric1 and to save checkpoint on metric2.
i.e, I wish to use ReduceOnPlatue on train_loss (to allow the network to (over)fit in case the lr is not low enough) and use checkpoint_on='val_acc', to save the best model along the training routine.
ok, then we should allow feeding the values logged as options to the callbacks?
Not sure myself (I'm not sure of the implementation details).
Would love if you could maybe tag someone who could answer that better.
Ido
haha ok. yeah, i think that’s the sensible option since this allows any metric to be monitored by any callback in the future
yeah, will solve issues with ModelCheckpoint
too. Also, why even add a checkpoint_on/early_stop_on
as a parameter there? can't we just use log itself with checkpoint=True/False
just like on_epoch/on_step
, if we feed logs in the callbacks? Just a suggestion.
because we need a single, unique value to checkpoint/early stop on. with log we can’t enforce that. and also if you want to change what to ckpt on during training, this approach allows that.
ie: in some tasks, i may use loss for a while, but switch to a metric after a certain number of epochs
then maybe take the first value in case if multiple checkpoint values are passed with log and raise a warning there?
result.log('metric1', metric1_val, checkpoint=True)
result.log('metric2', metric2_val, checkpoint=True)
take metric1_val for checkpoint.
let's go with the current approach in the API and iterate on it if it causes issues.
I am still confused about how to use ReduceLROnPlateau
, are there any simple examples to explain it?
@williamFalcon I checked the #3004 and find that the monitor seems can only record dicts in training_step
, but not validation_step
, Here is the code running in Colab (modified from MNIST Hello World).
class LitMNIST(pl.LightningModule):
def __init__(self, data_dir='./', hidden_size=64, learning_rate=2e-4):
super().__init__()
# Set our init args as class attributes
self.data_dir = data_dir
self.hidden_size = hidden_size
self.learning_rate = learning_rate
# Hardcode some dataset specific attributes
self.num_classes = 10
self.dims = (1, 28, 28)
channels, width, height = self.dims
self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# Define PyTorch model
self.model = nn.Sequential(
nn.Flatten(),
nn.Linear(channels * width * height, hidden_size),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_size, hidden_size),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(hidden_size, self.num_classes)
)
def forward(self, x):
x = self.model(x)
return F.log_softmax(x, dim=1)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.nll_loss(logits, y)
result = pl.TrainResult(loss)
result.log('train_loss', loss)
return result
def validation_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.nll_loss(logits, y)
preds = torch.argmax(logits, dim=1)
acc = accuracy(preds, y)
result = pl.EvalResult(checkpoint_on=loss)
# Calling result.log will surface up scalars for you in TensorBoard
result.log('val_loss', loss, prog_bar=True)
result.log('val_acc', acc, prog_bar=True)
return result
def test_step(self, batch, batch_idx):
# Here we just reuse the validation_step for testing
return self.validation_step(batch, batch_idx)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
scheduler = {'scheduler': lr_scheduler, 'interval': 'step', 'monitor': 'val_loss'}
return [optimizer], [scheduler]
####################
# DATA RELATED HOOKS
####################
def prepare_data(self):
# download
MNIST(self.data_dir, train=True, download=True)
MNIST(self.data_dir, train=False, download=True)
def setup(self, stage=None):
# Assign train/val datasets for use in dataloaders
if stage == 'fit' or stage is None:
mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])
# Assign test dataset for use in dataloader(s)
if stage == 'test' or stage is None:
self.mnist_test = MNIST(self.data_dir, train=False, transform=self.transform)
def train_dataloader(self):
return DataLoader(self.mnist_train, batch_size=32)
def val_dataloader(self):
return DataLoader(self.mnist_val, batch_size=32)
def test_dataloader(self):
return DataLoader(self.mnist_test, batch_size=32)
model = LitMNIST()
trainer = pl.Trainer(gpus=1, max_epochs=3, fast_dev_run=False, progress_bar_refresh_rate=20)
trainer.fit(model)
@invisprints in case you haven't figured it out, the note sections in this doc would be helpful for you: https://pytorch-lightning.readthedocs.io/en/stable/lightning-module.html#configure-optimizers
So your configure_optimizers()
should be something like:
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer)
# reduce every epoch (default)
scheduler = {
'scheduler': lr_scheduler,
'reduce_on_plateau': True,
# val_checkpoint_on is val_loss passed in as checkpoint_on
'monitor': 'val_checkpoint_on'
}
return [optimizer], [scheduler]
@yukw777 isn't *_checkpoint_on
specific to checkpoints? I mean one must set the ReduceLROnPlateau
monitor equal to what the checkpoint monitor is? They can't be different?
@yukw777 @rohitgr7 yeap, I mean if I want to monitor something different from checkpoint_on monitor, what should I do? Because in test code, it seems we can monitor anything we want, and it works in training step but not in val step.
Decided to rewrite my code to new API, still didn't get how to use ReduceLROnPlateau
:(
I got it
def training_step(self, batch, batch_idx):
loss = self.calc_loss(batch)
res = pl.TrainResult(loss)
res.log("train_loss", loss, prog_bar=True)
return res
def validation_step(self, batch, batch_idx):
loss = self.calc_loss(batch)
res = pl.EvalResult(checkpoint_on=loss)
res.log("val_loss", loss, prog_bar=True)
return res
def configure_optimizers(self):
optimizer = torch.optim.Adam(params=self.parameters(),
lr=self.hparams.lr,
weight_decay=self.hparams.l2_norm)
lr_scheduler = ReduceLROnPlateau(optimizer, patience=10, factor=0.9, verbose=True)
scheduler = {
'scheduler': lr_scheduler,
'reduce_on_plateau': True,
# val_checkpoint_on is val_loss passed in as checkpoint_on
'monitor': 'val_checkpoint_on'
}
return [optimizer], [scheduler]
@invisprints did your question get answered?
No, what I want is we can monitor anything we want, not just val_checkpoint_on
marking as duplicate.
Most helpful comment
I got it