I have implemented BERT, taking the output of [CLS] and feeding that to a linear layer on top to do regression. I froze the embedding layers of BERT, though. I was using the standard Adam optimizer and did not run into any issues.
When and/or why should one use BERTAdam? And, in a set-up like mine, would you use BERTAdam for BERT, and regular Adam for the rest of the whole model?
Some explanation is given here: https://github.com/huggingface/pytorch-pretrained-BERT/blob/694e2117f33d752ae89542e70b84533c52cb9142/README.md#optimizers
BertAdam
is a torch.optimizer
adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam
optimizer are the following:
BertAdam
implements weight decay fix,BertAdam
doesn't compensate for bias as in the regular Adam
optimizer.@stefan-it Thanks for the link. These improvements are not the same as the suggested AdamW improvements, I assume?
Yes they are the same. BertAdam
implements AdamW and in addition doesn't compensate for the bias (I don't know why the Google team decided to do that but that's what they did).
In most case we have been using standard Adam with good performances (example by using NVIDIA's apex fusedAdam as optimizer) so you probably shouldn't worry too much about the differences between the two. We've incorporated BertAdam
mostly to be able to exactly reproduce the behavior of the TensorFlow implementation.
@thomwolf Thanks, that explanation really helps. I have been using standard Adam with good results and BertAdam didn't improve that. So in my particular case it may not have been useful.
Closing this, as my question has been answered.
For reference for future visits, recent research suggests that the omission of the bias compensation in BERTAdam is one of the sources of instability in finetuning:
Most helpful comment
For reference for future visits, recent research suggests that the omission of the bias compensation in BERTAdam is one of the sources of instability in finetuning: