Flair: Named entity starts with "I" tag

Created on 29 Aug 2018  路  6Comments  路  Source: flairNLP/flair

I came across an unexpected NER tagging result where a named entity starts with the I-MISC tag.
I had assumed named entities can only start with B (beginning) or S (single token).

The pre-tokenized raw sentence:
Laurel : Walter Asonevich , president of Pennsylvania Highlands Community College , recently was presented with the Boy Scouts of America Laurel Highlands Council 2018 Distinguished Citizen Award .

The tagged result:
Laurel <S-LOC> : Walter <B-PER> Asonevich <E-PER> , president of Pennsylvania <B-ORG> Highlands <I-ORG> Community <I-ORG> College <E-ORG> , recently was presented with the Boy <B-ORG> Scouts <I-ORG> of <I-ORG> America <I-ORG> Laurel <I-ORG> Highlands <I-ORG> Council <E-ORG> 2018 Distinguished Citizen <I-MISC> Award <E-MISC> .

The unexpected results / potential bug
Note the last two tags: Citizen Award gets recognised as a named entity - but the sequence starts with I-MISC (after the previous named entity has ended on E-ORG). I should be reserved for inner tokens, or am I mistaken?

This messed up my tag parser since I assumed named entities HAVE to start with B or S type tags.

Did I make a mistake?

Added information:

  • basic model: 'ner'
release-0.3

Most helpful comment

Yes, I think we can address this in multiple ways. For the next release, we want to add a convenience method to extract entity spans from text so that you do not have to interpret the B-, I-, E- tags yourself, essentially like you suggested in #54, and also required for us to implement #75. Will be added soon! :)

All 6 comments

Thanks for reporting this. This seems strange - I'd like to take a closer look. Are you using the master branch or the latest pip version? How often does such an error occur in, say, 1000 sentences?

@alanakbik It is a rare occurrence (maybe <1 in 1000 sentences). That's why it was so surprising and took me hours to find as the root cause.

I used latest pip version.

Please assume I am an absolute noob. But given the sentence, you should be able to replicate the error.

Ah ok! Such errors can happen in hopefully very rare cases because the B and S logic is not hard-coded in the decoder. The CRF learns this logic from the training data, but there may be sentences where it still believes that an I- tag following an untagged word is most plausible given the model.

We'll look into this some more and see if hard-coding BIO or BIOES logic makes a difference!

@blythed: any thoughts?

It is a rather annoying problem to fix for me when I parse tags. So, I would prefer a solution within Flair. It might even increase your F1 score (if your training data contains token-wise labels, like I-MISC).

I would have to check if an "I" tag or "E" tag is used before a "B" tag has been used. I would probably have to conduct further checks, especially when the previous token carries a non-empty NE tag. In the latter case, it may be impossible to understand where a NE starts and where it ends.

Yes, I think we can address this in multiple ways. For the next release, we want to add a convenience method to extract entity spans from text so that you do not have to interpret the B-, I-, E- tags yourself, essentially like you suggested in #54, and also required for us to implement #75. Will be added soon! :)

in release -0.3

Was this page helpful?
0 / 5 - 0 ratings

Related issues

inyukwo1 picture inyukwo1  路  3Comments

Y4rd13 picture Y4rd13  路  3Comments

jewl123 picture jewl123  路  3Comments

frtacoa picture frtacoa  路  3Comments

jannenev picture jannenev  路  3Comments