It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first 10,000 steps.
Thanks, but why dont they specifically mention 'linear warmup' when they do mention 'linear decay' afterwards?
Actually, they do linear warmup first (as I described in my previous reply) and then do a linearly decrease (i.e. linear decay) the learning rate from 2e-5 to 0 over the remaining number of steps (still using my previous reply as an example).Check out this part of the code for implementation details.
Ok @hsm207, thanks a lot! going to close the issue.
hello @hsm207
why warmup is required for BERT(MLM , from scratch) ?
and generally speaking , the idea of cosine annealing is some thing in contradiction with that , i think
hi @mahdirezaey
This way of adjusting learning rate is called learning rate scheduling. It's been found to help improve performance of deep learning models. The intuition is that when you start training, you are probably far from optimal so you can afford to take big steps i.e. large learning rate. After a while, you probably have an idea of where a reasonable optima is so you should take smaller steps to converge to it.
If you are interested in exploring the history behind this idea, I recommend starting with this paper:
https://arxiv.org/abs/1803.09820
and then work backwards by reading the references there.
Another advantage of a high learning rate near the beginning (after warmup, which is another issue) is that it has a regularisation effect as it ends up in a relatively "flat" part of parameter space (ie: the hessian of the loss is relatively small). The idea of "super-convergence" tries to utilise this (paper, blog).
Most helpful comment
It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first 10,000 steps.