The paper stated that the model was pretrained for 1M steps. May I know how high roughly the masked_lm_accuracy is expected to be at the end of training? Is a development set used in paper?
I don't know the exact accuracy, but the held-out natural log likelihood for BERT-Base was around -1.4 (so e^1.4 = 4.0 was the perplexity I show in a table). It depends on the language and corpus but you should definitely expect something better than -2.0 (better than == closer to zero).
But a much better way of measuring progress is to take intermediate checkpoints and use them to fine-tune a downstream task.
Same question, thank you.
in the readme 馃憤
Eval results
global_step = 20
loss = 0.0979674
masked_lm_accuracy = 0.985479
masked_lm_loss = 0.0979328
next_sentence_accuracy = 1.0
next_sentence_loss = 3.45724e-05
Most helpful comment
I don't know the exact accuracy, but the held-out natural log likelihood for
BERT-Basewas around -1.4 (so e^1.4 = 4.0 was the perplexity I show in a table). It depends on the language and corpus but you should definitely expect something better than -2.0 (better than == closer to zero).But a much better way of measuring progress is to take intermediate checkpoints and use them to fine-tune a downstream task.