Bert: When to stop training? What is a good valid loss value to stop ? How to improve classification performance?

Created on 10 Nov 2018 · 22Comments · Source: google-research/bert

Say if I have 1 epoch(pretraining finetune task on a large corpus for downstream classification) will take 100w step, and I got loss value 1.42 at step 47w, is it good enough to stop, or may I need to train more steps ?
Also for the published cn bert model, I have tested on one chinese sentiment corpus, seems transformer convergent slow and the performance is not so good comparing to rnn based models. Also if I add one rnn layer above transformer output the result is a bit better but still can not be as good as rnn only model.
In order to improve I have also done pre training from the published model on sentiment specfic large corpus(10w step finetune training, valid loss 1.56) before doing classification job.
So what can I do to improve the result ？ More steps of pre training ? Adjust learning rate ? Or for Chinese corpus transformer structure and text classification jobs, transformer could not perform as good as rnn(gru) ?
I'm very interested in this since on kaggle contest of jigsaw toxic comment classification, the best models are all rnn based not transformer or cnn structure based.

Source

chenghuige

Most helpful comment

i modified BERT for multichoice mrc(ai challenger18 oqmrc) just like GPT, and on dev set, i got
accuracy 0.91 on dev set. The loss on dev(0.197) is higher than training set(0.06).But the training process is highly unstable, largely depending on small mini-batch(for gtx1080Ti, the batch size is set to 6)

yyht on 12 Nov 2018

👍8 🎉2

All 22 comments

The best way to know when to stop pre-training is to take intermediate checkpoints and fine-tune them for a downstream task, and see when that stops helping (by more than some trivial amount). But keep in mind there is a lot of randomness so you'll want to take several random restarts for fine-tuning.

It's suprising to me that the published Chinese model is not SOTA on sentiment, is this a public corpus that I can download and try? I wanted to try a few other Chinese tasks than just XNLI.

jacobdevlin-google on 10 Nov 2018

@jacobdevlin-google You could try https://challenger.ai/competition/fsauor2018, currently bert model only give me some ensemble earnings, anyway the great code from bert help me a lot , thanks ! There might be some errors for my usage since I only used your modeling code and the published model, one thing for sure is that using published model will have much better result then using random started transformer (random started transformer also has much worse result then random started rnn models). Another thing is though most docs len <= 512, there are many docs len > 512, since I used gtx1080ti(11g mem) I cut those length > 512 to 512, also use bucket length to run so loose some information and also randomness comparing to rnn models, but I still think this should not hurt performance so much.

chenghuige on 10 Nov 2018

Ok thanks! I'll ask Ming-Wei (second author and also Chinese) to run it on this dataset or another Chinese dataset this week.

jacobdevlin-google on 10 Nov 2018

@jacobdevlin-google Great, looking forward for the result of bert on this corpus.
https://challenger.ai/competition/oqmrc2018 You may also try this one Chinese qa task.

chenghuige on 10 Nov 2018

@jacobdevlin-google Another thing strange is when using bert for sentiment classification, I got valid loss lower then training loss. Due to dropout rate ? I used the default parameters.

chenghuige on 11 Nov 2018

That's not typical for the downstream tasks, usually the validation loss is quite "bad" (even though the accuracy is high). You try training more and seeing if the validation accuracy goes up.

jacobdevlin-google on 11 Nov 2018

@jacobdevlin-google keep training, valid loss and training loss both decreasing(valid f1 score increasing) but still valid loss lower then training loss(like valid loss 0.38 train loss 0.46) and both train and valid loss decrease slower then rnn models(especailly the training loss, very strange compare to rnn models which decrease much faster) I think bad performance is related to the training loss not convergent or convergent too slow. May be I have some bugs, so glad to hear Ming-Wei's experiments result on this corpus.

chenghuige on 11 Nov 2018

@chenghuige did you try big learning rate?

xwzhong on 12 Nov 2018

@xwzhong Yes, in my experiments, big learning rate will perform worse.

chenghuige on 12 Nov 2018

yyht on 12 Nov 2018

👍8 🎉2

I also reimplement the multiple choice model for ai challenger18 oqmrc based on BERT, I get 0.78 acc on dev dataset, and the same for test.

lixinsu on 12 Nov 2018

@lixinsu good result, so your single model using bert on mrc can rank top 10 on test set A ? Cool.

chenghuige on 12 Nov 2018

@yyht Amazing... So I must have done something wrong ..

chenghuige on 12 Nov 2018

I find the result is highly related to parameters use especially learning rate (how many epochs for decay, how many steps for warm up)

chenghuige on 12 Nov 2018

👀1 👍1

@chenghuige agree with you

xwzhong on 13 Nov 2018

Could you guys share a dataset copy of https://challenger.ai/competition/fsauor2018 or https://challenger.ai/competition/oqmrc2018? Now the dataset cannot be downloaded.

jiqiujia on 13 Nov 2018

链接：https://pan.baidu.com/s/1MRiP1I-SGdUzhQrNctEq4A
提取码：jsmk
enjoy it

yyht on 14 Nov 2018

@yyht thanks!

jiqiujia on 14 Nov 2018

After doing more parameter tuning, I was able to got f1 70.08 and it perform good on some classes, but the overall performance not as good as rnn models and it took more epochs to reach 70+, need 20+ epochs while rnn model only need 4-5 epochs. For this sentiment corpus looks bert model is very sensitive to learning rate and other parameters, so not sure the best result bert models can got on this specific corpus. (I added bert models to my ensemble and it helps a lot, I think bert models should perform even better may be cn word based pretrain bert models will do better)

chenghuige on 14 Nov 2018

👍6

Update bert performance on ai challenger 2018 sentiment dataset. Without using buckets, just set max length to 512(> 512 remove some parts in the middle, about 12% training data > 512). Using batch size 6 or batch size 24(4 gpu) with 3 epoch it can get f1 71.1 with cross entropy loss 0.34(I did not do any weight adjust on loss or down sampling up sampling on data set), still not as good as lstm with elmo(f1 72 and loss 0.322) seems bert model overfit comparing with rnn(bert train loss much lower then rnn but valid loss not as good as rnn) If you got better result using bert on this corpus any hints will be much appreciated :)

chenghuige on 8 Dec 2018

👍2

Update bert performance on ai challenger 2018 sentiment dataset. Without using buckets, just set max length to 512(> 512 remove some parts in the middle, about 12% training data > 512). Using batch size 6 or batch size 24(4 gpu) with 3 epoch it can get f1 71.1 with cross entropy loss 0.34(I did not do any weight adjust on loss or down sampling up sampling on data set), still not as good as lstm with elmo(f1 72 and loss 0.322) seems bert model overfit comparing with rnn(bert train loss much lower then rnn but valid loss not as good as rnn) If you got better result using bert on this corpus any hints will be much appreciated :)

Hello, could you show me the configurations of your sentiment model, like learning rate.

CSLujunyu on 30 Apr 2019

👍1

i get valid dataset mean f1 score is 0.712.(global step is 1155000)
batch_size 2,max seq 512(>512 drop part in the last) ,lr 2e-5 .when i set num_train_steps = 1000000000 so big, warmup_proportion(0.1),so it always be warm up(maybe bug)
update
after set num_train_steps to 10 epochs got f1 is 0.7252485481231555

Update bert performance on ai challenger 2018 sentiment dataset. Without using buckets, just set max length to 512(> 512 remove some parts in the middle, about 12% training data > 512). Using batch size 6 or batch size 24(4 gpu) with 3 epoch it can get f1 71.1 with cross entropy loss 0.34(I did not do any weight adjust on loss or down sampling up sampling on data set), still not as good as lstm with elmo(f1 72 and loss 0.322) seems bert model overfit comparing with rnn(bert train loss much lower then rnn but valid loss not as good as rnn) If you got better result using bert on this corpus any hints will be much appreciated :)

Hello, could you show me the configurations of your sentiment model, like learning rate.