What is the most appropriate way to add learning rate warmup ?
I am thinking about using the hooks.def on_batch_end(self):, but not sure where to put this function to ? Thank you.
You can use a learning rate scheduler and return it in choose_optimizers.
Well, learning_rate_warmup change learning rate every batch. Most learning rate scheduler just change after each epoch. Can you explain how to use choose_optimizer to do lr_warmup???
Same question here. In Transformer, the LR is adjusted by training step, not epoch. Is there a solution?
You can also override optimizer_step and do it there. Here's an example where the first 500 batches are for warm up.
def optimizer_step(self, epoch_nb, batch_nb, optimizer, optimizer_i, opt_closure):
if self.trainer.global_step < 500:
lr_scale = min(1., float(self.trainer.global_step + 1) / 500.)
for pg in optimizer.param_groups:
pg['lr'] = lr_scale * self.hparams.learning_rate
optimizer.step()
optimizer.zero_grad()
So, I ended up with something like this:
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=self.hparams.lr)
def lr_foo(epoch):
if epoch < self.hparams.warm_up_step:
# warm up lr
lr_scale = 0.1 ** (self.hparams.warm_up_step - epoch)
else:
lr_scale = 0.95 ** epoch
return lr_scale
scheduler = LambdaLR(
optimizer,
lr_lambda=lr_foo
)
return [optimizer], [scheduler]

PS: to pytorch-lighting creators and contributors: thank you for contributing, I was searching for such approach (define loss/optim/etc in model class) for years!
I just stumbled upon this issue, as I was also looking for a way to make my LR scheduler update on each step instead of each epoch. After doing some additional research I found that there is a better way of doing this than overwriting optimizer_step. I am guessing this features wasn't available yet when this issue initially came up, but in version 1.0.3 (don't know the exact version this has been added though) you can just do this:
def configure_optimizers(self):
optimizer = AdamW(self.parameters(), lr=self.learning_rate)
scheduler = InverseSquareRootLR(optimizer, self.lr_warmup_steps)
return (
[optimizer],
[
{
'scheduler': scheduler,
'interval': 'step',
'frequency': 1,
'reduce_on_plateau': False,
'monitor': 'val_loss',
}
]
)
Most helpful comment
You can also override
optimizer_stepand do it there. Here's an example where the first 500 batches are for warm up.