Spacy: Error should be raised if you try to set conflicting doc.ents

Created on 16 Jul 2018 · 11Comments · Source: explosion/spaCy

How to reproduce the behaviour

import spacy
from spacy.tokens import Span
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Louisiana Office of Conservation")
print(doc.ents[0], doc.ents[0].label_)

entity = Span(doc, 0, 1, label=391L)
print(entity, entity.label_)

doc.ents = list(doc.ents) + [entity]

print(doc.ents[0], doc.ents[0].label_)

Output:

(Louisiana Office of Conservation, u'ORG')
(Louisiana, u'MONEY')
(Louisiana Office of Conservation, u'MONEY')

Expected Behaviour:

I was expecting MONEY entity to override the ORG entity and end offset of MONEY should have been used after merging. Please let me know if I got this wrong.

Your Environment

Operating System:Windows 7
Python Version Used: 2.7.12
spaCy Version Used: 2.0.7
Environment Information: None

bug feat / doc feat / ner

Source

chandan2495

Most helpful comment

There is a way to handle the error here?
Indeed to have the entity with the maximum length as default would be great.

is there a way to get the ID of the entity causing the error to handle the situation?

rbrtdambrosio on 11 Feb 2019

👍2

All 11 comments

Thanks a lot for the great and detailed bug report 👍 I just tried the example and I can confirm that it's also reproducible in v2.1.0 nightly. The current behaviour is definitely a bug.

In terms of the expected behaviour: I suggest that spaCy should actually raise an error here. By definition, a token can only be part of one entity. So when the doc.ents are set and an overlapping entity exists, there's no clear answer for how this should be resolved. Which entity should take precedence?

So it's probably more intuitive and logical if spaCy raised with a message like: "Trying to set conflicting doc.ents: (0, 4, 'ORG') and (0, 1, 'MONEY'). A token can only be part of one entity, so make sure the entities you're setting don't overlap."

ines on 16 Jul 2018

Thanks for quick reply Ines. I think the new span can take preference on existing one as this will help users to create new derived entities.

chandan2495 on 17 Jul 2018

This is fixed now in spacy-nightly 🎉 . See also #2566.

honnibal on 20 Dec 2018

👎2

I actually am finding the raising of an error here is causing more problems for me than it is resolving. I have many causes where I have for instance both "Jon" and and "Jon Smith" as NAME entities. I do want both tagged as a NAME. I would prefer instead in such cases to tag whichever entity is longer as this is likely to be more specific rather than raising an error. There are many other instances as well and I do not want to change my setup to exclude overlapping.

isaacmg on 21 Dec 2018

I would like spaCy to preserve the longer entity as well.

@isaacmg
Both "Jon" and and "Jon Smith" can not be tagged as named since in first case 'Jon' gets a tag of 'U'
whereas in second case a 'B'. Model is bound to be confused.

One way as mentioned already is to take care of the overlaps.

Also there are scenarios where same token / char span is tagged to 2 entities. I guess that comes up as an error now. Not tested.

rbhambriiit on 25 Dec 2018

👍1

This is fixed now in spacy-nightly . See also #2566.

I still have error with this code even after installing spacy-nightly.

ValueError: [E103] Trying to set conflicting doc.ents: '(9, 11, u'ANIMAL')' and '(10, 11, u'ANIMAL')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

arezae on 16 Jan 2019

@honnibal I would like an answer about how to override this error message at the very least. As this issue is currently blocking me from using the updated version of Spacy. I have tried a variety of solutions but nothing works (note for me eliminating overlapping entities is not an option). I cannot find any easy way to subclass Doc in order to the change the line of code that this PR implemented. What is the best way to override this behavior short of me maintaining my own separate Spacy branch?

Thanks.

isaacmg on 17 Jan 2019

👍1

There is a way to handle the error here?
Indeed to have the entity with the maximum length as default would be great.

is there a way to get the ID of the entity causing the error to handle the situation?

rbrtdambrosio on 11 Feb 2019

👍2

I am also running into this as a problem, I have a PhraseMatcher finding entities on top of the NER pipeline, but I would like to be able to link both entities. What's the suggested workaround?

phdowling on 27 Feb 2019

Can someone provide sample code illustrating how to make the longer and/or newer span take precedence? @rbhambriiit, it sounds like you might know how to do this?