Transformers: Advantage of BertAdam over Adam?

Created on 28 Mar 2019 · 5Comments · Source: huggingface/transformers

I have implemented BERT, taking the output of [CLS] and feeding that to a linear layer on top to do regression. I froze the embedding layers of BERT, though. I was using the standard Adam optimizer and did not run into any issues.

When and/or why should one use BERTAdam? And, in a set-up like mine, would you use BERTAdam for BERT, and regular Adam for the rest of the whole model?

Discussion

Source

BramVanroy

👍2

Most helpful comment

For reference for future visits, recent research suggests that the omission of the bias compensation in BERTAdam is one of the sources of instability in finetuning:

noe on 9 Jul 2020

👍16 ❤3

All 5 comments

Some explanation is given here: https://github.com/huggingface/pytorch-pretrained-BERT/blob/694e2117f33d752ae89542e70b84533c52cb9142/README.md#optimizers

BertAdam is a torch.optimizer adapted to be closer to the optimizer used in the TensorFlow implementation of Bert. The differences with PyTorch Adam optimizer are the following:

BertAdam implements weight decay fix,
BertAdam doesn't compensate for bias as in the regular Adam optimizer.

stefan-it on 28 Mar 2019

👍4

@stefan-it Thanks for the link. These improvements are not the same as the suggested AdamW improvements, I assume?

BramVanroy on 29 Mar 2019

Yes they are the same. BertAdam implements AdamW and in addition doesn't compensate for the bias (I don't know why the Google team decided to do that but that's what they did).

In most case we have been using standard Adam with good performances (example by using NVIDIA's apex fusedAdam as optimizer) so you probably shouldn't worry too much about the differences between the two. We've incorporated BertAdammostly to be able to exactly reproduce the behavior of the TensorFlow implementation.

thomwolf on 3 Apr 2019

❤7

@thomwolf Thanks, that explanation really helps. I have been using standard Adam with good results and BertAdam didn't improve that. So in my particular case it may not have been useful.

Closing this, as my question has been answered.

BramVanroy on 3 Apr 2019

❤3

For reference for future visits, recent research suggests that the omission of the bias compensation in BERTAdam is one of the sources of instability in finetuning:

noe on 9 Jul 2020

👍16 ❤3

Was this page helpful?

0 / 5 - 0 ratings