Bert: What is exactly the learning rate warmup described in the paper?

Created on 10 Feb 2019 · 7Comments · Source: google-research/bert

screenshot from 2019-02-10 19-31-45

Source

JoaoLages

Most helpful comment

It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first 10,000 steps.

hsm207 on 14 Feb 2019

👍27

All 7 comments

It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first 10,000 steps.

hsm207 on 14 Feb 2019

👍27

Thanks, but why dont they specifically mention 'linear warmup' when they do mention 'linear decay' afterwards?

JoaoLages on 16 Feb 2019

Actually, they do linear warmup first (as I described in my previous reply) and then do a linearly decrease (i.e. linear decay) the learning rate from 2e-5 to 0 over the remaining number of steps (still using my previous reply as an example).Check out this part of the code for implementation details.

hsm207 on 17 Feb 2019

👍9

Ok @hsm207, thanks a lot! going to close the issue.

JoaoLages on 18 Feb 2019

👍1

hello @hsm207
why warmup is required for BERT(MLM , from scratch) ?
and generally speaking , the idea of cosine annealing is some thing in contradiction with that , i think

mahdirezaey on 29 Mar 2020

hi @mahdirezaey

This way of adjusting learning rate is called learning rate scheduling. It's been found to help improve performance of deep learning models. The intuition is that when you start training, you are probably far from optimal so you can afford to take big steps i.e. large learning rate. After a while, you probably have an idea of where a reasonable optima is so you should take smaller steps to converge to it.

If you are interested in exploring the history behind this idea, I recommend starting with this paper:

https://arxiv.org/abs/1803.09820

and then work backwards by reading the references there.

hsm207 on 29 Mar 2020

Another advantage of a high learning rate near the beginning (after warmup, which is another issue) is that it has a regularisation effect as it ends up in a relatively "flat" part of parameter space (ie: the hessian of the loss is relatively small). The idea of "super-convergence" tries to utilise this (paper, blog).