Bert: In ner task, do I need to add crf or just softmax in the end of the module?

Created on 18 Jul 2019 · 12Comments · Source: google-research/bert

I know in the original paper, they use softmax at the end of the module, but I wonder whether using crf will improve the performance ? Thanks

Source

RoderickGu

👀1

All 12 comments

should be better

LiangYuHai on 20 Jul 2019

@LiangYuHai Thanks for your answer. Did you try such comparison before ?

RoderickGu on 21 Jul 2019

Crf is the best model for ner tasks. I guess it will definitely improve the performance. I'm trying on that as well.

geyingli on 22 Jul 2019

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

RoderickGu on 23 Jul 2019

👍1

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

Hi! Could you share some details in your training ? Like learning rate , batch_size or other tricks.
Cause I finetune bert on a Chinese NER dataset and didn't get better result than traditional bilstm-crf model. It would be helpful if you could share some details.Thanks~

lrs1353281004 on 26 Jul 2019

There is not much trick on bert finetuning. I could share some details if that helps. Batch size is 32, optimizer is SGD instead of Adam. It is finetune on NER task.

RoderickGu on 26 Jul 2019

in my case, i got the best result(92.1 ~ 92.23 f1 score by conlleval) for CoNLL(english) data given :

batch size : 16
learning rate : 2.00e-05
bert model : large
optimizer : AdamWeightDecayOptimizer
- warmup : 2 epoch
- exponential decay : 2000 steps
hidden size of bilstm on the top of bert layer : 200
crf on the top of bilstm : used
bert dropout : 0.1
other dropout : 0.1
data shuffle : used

if no crf, the f1 scores are in range 91.2 ~ 91.8.

but 92.23p is not the average and still behind the score in the paper(BERT).

i think the ELMo + Glove embedding is more powerful for NER.
(92.5 ~ 92.8 f1 score)

dsindex on 26 Jul 2019

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

RoderickGu on 25 Aug 2019

@RoderickGu

i think the difference is not that significant but better to use.
my experiment shows that lstm gives 0.1~ 0.2% gain over bert only with fine-tune.

dsindex on 25 Aug 2019

👀1

@dsindex Thanks for your suggestions

RoderickGu on 25 Aug 2019

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

anjani-dhrangadhariya on 28 Aug 2020

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

That could be interesting results.

RoderickGu on 29 Aug 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings