If crf needed when do ner? In BertForTokenClassification, just Linear is used to predict tag. If not, why?
A CRF gives better NER F1 scores in some cases, but not necessarily in all cases. In the BERT paper, no CRF is used and hence also no CRF in this repository. I'd presume the BERT authors tested both with and without CRF and found that a CRF layer gives no improvement, since using a CRF is kind of the default setting nowadays.
Issue #64 is a good reference for discussion on NER.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
A CRF gives better NER F1 scores in some cases, but not necessarily in all cases. In the BERT paper, no CRF is used and hence also no CRF in this repository. I'd presume the BERT authors tested both with and without CRF and found that a CRF layer gives no improvement, since using a CRF is kind of the default setting nowadays.