I came across an unexpected NER tagging result where a named entity starts with the I-MISC tag.
I had assumed named entities can only start with B (beginning) or S (single token).
The pre-tokenized raw sentence:
Laurel : Walter Asonevich , president of Pennsylvania Highlands Community College , recently was presented with the Boy Scouts of America Laurel Highlands Council 2018 Distinguished Citizen Award .
The tagged result:
Laurel <S-LOC> : Walter <B-PER> Asonevich <E-PER> , president of Pennsylvania <B-ORG> Highlands <I-ORG> Community <I-ORG> College <E-ORG> , recently was presented with the Boy <B-ORG> Scouts <I-ORG> of <I-ORG> America <I-ORG> Laurel <I-ORG> Highlands <I-ORG> Council <E-ORG> 2018 Distinguished Citizen <I-MISC> Award <E-MISC> .
The unexpected results / potential bug
Note the last two tags: Citizen Award gets recognised as a named entity - but the sequence starts with I-MISC (after the previous named entity has ended on E-ORG). I should be reserved for inner tokens, or am I mistaken?
This messed up my tag parser since I assumed named entities HAVE to start with B or S type tags.
Did I make a mistake?
Added information:
Thanks for reporting this. This seems strange - I'd like to take a closer look. Are you using the master branch or the latest pip version? How often does such an error occur in, say, 1000 sentences?
@alanakbik It is a rare occurrence (maybe <1 in 1000 sentences). That's why it was so surprising and took me hours to find as the root cause.
I used latest pip version.
Please assume I am an absolute noob. But given the sentence, you should be able to replicate the error.
Ah ok! Such errors can happen in hopefully very rare cases because the B and S logic is not hard-coded in the decoder. The CRF learns this logic from the training data, but there may be sentences where it still believes that an I- tag following an untagged word is most plausible given the model.
We'll look into this some more and see if hard-coding BIO or BIOES logic makes a difference!
@blythed: any thoughts?
It is a rather annoying problem to fix for me when I parse tags. So, I would prefer a solution within Flair. It might even increase your F1 score (if your training data contains token-wise labels, like I-MISC).
I would have to check if an "I" tag or "E" tag is used before a "B" tag has been used. I would probably have to conduct further checks, especially when the previous token carries a non-empty NE tag. In the latter case, it may be impossible to understand where a NE starts and where it ends.
Yes, I think we can address this in multiple ways. For the next release, we want to add a convenience method to extract entity spans from text so that you do not have to interpret the B-, I-, E- tags yourself, essentially like you suggested in #54, and also required for us to implement #75. Will be added soon! :)
in release -0.3
Most helpful comment
Yes, I think we can address this in multiple ways. For the next release, we want to add a convenience method to extract entity spans from text so that you do not have to interpret the B-, I-, E- tags yourself, essentially like you suggested in #54, and also required for us to implement #75. Will be added soon! :)