Flair: Named entity starts with "I" tag

Created on 29 Aug 2018 · 6Comments · Source: flairNLP/flair

I came across an unexpected NER tagging result where a named entity starts with the I-MISC tag.
I had assumed named entities can only start with B (beginning) or S (single token).

The pre-tokenized raw sentence:
Laurel : Walter Asonevich , president of Pennsylvania Highlands Community College , recently was presented with the Boy Scouts of America Laurel Highlands Council 2018 Distinguished Citizen Award .

The tagged result:
Laurel <S-LOC> : Walter <B-PER> Asonevich <E-PER> , president of Pennsylvania <B-ORG> Highlands <I-ORG> Community <I-ORG> College <E-ORG> , recently was presented with the Boy <B-ORG> Scouts <I-ORG> of <I-ORG> America <I-ORG> Laurel <I-ORG> Highlands <I-ORG> Council <E-ORG> 2018 Distinguished Citizen <I-MISC> Award <E-MISC> .

The unexpected results / potential bug
Note the last two tags: Citizen Award gets recognised as a named entity - but the sequence starts with I-MISC (after the previous named entity has ended on E-ORG). I should be reserved for inner tokens, or am I mistaken?

This messed up my tag parser since I assumed named entities HAVE to start with B or S type tags.

Did I make a mistake?

Added information:

basic model: 'ner'

release-0.3

Source

pwichmann

Most helpful comment

Yes, I think we can address this in multiple ways. For the next release, we want to add a convenience method to extract entity spans from text so that you do not have to interpret the B-, I-, E- tags yourself, essentially like you suggested in #54, and also required for us to implement #75. Will be added soon! :)

alanakbik on 3 Sep 2018

❤2 🎉2 👍2

All 6 comments

Thanks for reporting this. This seems strange - I'd like to take a closer look. Are you using the master branch or the latest pip version? How often does such an error occur in, say, 1000 sentences?

alanakbik on 29 Aug 2018

👍1

@alanakbik It is a rare occurrence (maybe <1 in 1000 sentences). That's why it was so surprising and took me hours to find as the root cause.

I used latest pip version.

Please assume I am an absolute noob. But given the sentence, you should be able to replicate the error.

pwichmann on 29 Aug 2018

Ah ok! Such errors can happen in hopefully very rare cases because the B and S logic is not hard-coded in the decoder. The CRF learns this logic from the training data, but there may be sentences where it still believes that an I- tag following an untagged word is most plausible given the model.

We'll look into this some more and see if hard-coding BIO or BIOES logic makes a difference!

@blythed: any thoughts?

alanakbik on 29 Aug 2018

It is a rather annoying problem to fix for me when I parse tags. So, I would prefer a solution within Flair. It might even increase your F1 score (if your training data contains token-wise labels, like I-MISC).

I would have to check if an "I" tag or "E" tag is used before a "B" tag has been used. I would probably have to conduct further checks, especially when the previous token carries a non-empty NE tag. In the latter case, it may be impossible to understand where a NE starts and where it ends.

pwichmann on 29 Aug 2018

alanakbik on 3 Sep 2018

❤2 🎉2 👍2

in release -0.3