Bert: In ner task, do I need to add crf or just softmax in the end of the module?

Created on 18 Jul 2019  路  12Comments  路  Source: google-research/bert

I know in the original paper, they use softmax at the end of the module, but I wonder whether using crf will improve the performance ? Thanks

All 12 comments

should be better

@LiangYuHai Thanks for your answer. Did you try such comparison before ?

Crf is the best model for ner tasks. I guess it will definitely improve the performance. I'm trying on that as well.

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

@geyingli I think crf is a good decoder for sequence task. But bert is very powerful and already contains sequence information. Do you get a good result on crf now ? Besides, I find finetuning bert on NER task helps a lot.

Hi! Could you share some details in your training ? Like learning rate , batch_size or other tricks.
Cause I finetune bert on a Chinese NER dataset and didn't get better result than traditional bilstm-crf model. It would be helpful if you could share some details.Thanks~

There is not much trick on bert finetuning. I could share some details if that helps. Batch size is 32, optimizer is SGD instead of Adam. It is finetune on NER task.

in my case, i got the best result(92.1 ~ 92.23 f1 score by conlleval) for CoNLL(english) data given :

  1. batch size : 16
  2. learning rate : 2.00e-05
  3. bert model : large
  4. optimizer : AdamWeightDecayOptimizer

    • warmup : 2 epoch

    • exponential decay : 2000 steps

  5. hidden size of bilstm on the top of bert layer : 200
  6. crf on the top of bilstm : used
  7. bert dropout : 0.1
  8. other dropout : 0.1
  9. data shuffle : used

if no crf, the f1 scores are in range 91.2 ~ 91.8.

but 92.23p is not the average and still behind the score in the paper(BERT).

i think the ELMo + Glove embedding is more powerful for NER.
(92.5 ~ 92.8 f1 score)

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

@RoderickGu

i think the difference is not that significant but better to use.
my experiment shows that lstm gives 0.1~ 0.2% gain over bert only with fine-tune.

@dsindex Thanks for your suggestions

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

@dsindex I noticed that you added LSTM layer on the top of BERT, do you think it performs better that without LSTM ? Thanks

For all my NER tasks, LSTM on top of BERT consistently boosts the performance.

That could be interesting results.

Was this page helpful?
0 / 5 - 0 ratings