Bert: Are linear decay, L2 normalization and learned positional embs essential to the performance?

Created on 2 Nov 2018 · 3Comments · Source: google-research/bert

Hello, I've been training my version of bert (i.e. not from this repo, but i think the main idea was implemented) on Chinese over a week, however the performance is not so promising. (the problem could be implementation, dataset, time of training or the language difference between English and Chinese) And as for the optimizer, i use the adam without linear decay and L2 normalization, and I use sinusoidal positional embeddings to reduce the number of variables, could you tell the importance of them? are they essential to the final performance? Any trick for transferring to other language? Thanks very much!

Source

LorrinWWW

Most helpful comment

You're using a batch size of 32,000 words/batch for 250k steps, so it's been trained for about 6% as much as BERT, which was run with 128,000 words/batch for 1M steps. If you look at Figure 4 in the paper (second to last page), the BERT results grow rapidly during the first 20% of training. Note that the figure is accuracy of a downstream task after fine-tuning from the pre-training checkpoint.

That accuracy seems about right, but the better way to track progress is to fine-tune on a downstream task. So if you have some Chinese sentence classification task, try fine-tuning from checkpoints at 100k steps, 150k steps, 200k steps, 250k steps. If the accuracy of the downstream task keeps improving significantly then it's probably just a training time issue.

Due to the high demand, I'm planning on getting out a multilingual BERT-Base model very soon (maybe early next week?), which has a particular focus on simplified and traditional Chinese. Then you can run more pre-training from additional steps starting from that model and hopefully it will converge much faster.

jacobdevlin-google on 2 Nov 2018

👍3

All 3 comments

The number of parameters from learned position embeddings is pretty trivial so I would just use learned embeddings. The other things are not essential. My guess would be that the issue is the training time. I'm not sure what hardware you've used, but for example if you trained on one GPU for a week then the results will probably not be very good. What is your sequence length, batch size, and number of steps you've trained for? Or it could be the implementation. In any case switching this codebase should fix any implementation issues (but not solve the training time issue). I've been training a multilingual BERT model on character-tokenized Chinese (along with a lot of other languages) and the results seem good, so I don't think it's anything to do with the language.

jacobdevlin-google on 2 Nov 2018

👍3

Thanks for your reply. I train it on 8 GPUs, the batch size is 256, sequence length is limited at 128, and the global step is 250k so far. I think the main issue may be training time and the quality of corpus.

May I ask you another question? in the process of training masked LM, what's the typical accuracy of predicting the masked part? I ran it on wiki, news, and comments, the acc varied between 60% ~ 70%, but I have no idea when we can say that it is well converged. Is it necessary to tune it on a specific task to judge its convergence?

LorrinWWW on 2 Nov 2018