import spacy
from spacy.tokens import Span
nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Louisiana Office of Conservation")
print(doc.ents[0], doc.ents[0].label_)
entity = Span(doc, 0, 1, label=391L)
print(entity, entity.label_)
doc.ents = list(doc.ents) + [entity]
print(doc.ents[0], doc.ents[0].label_)
(Louisiana Office of Conservation, u'ORG')
(Louisiana, u'MONEY')
(Louisiana Office of Conservation, u'MONEY')
I was expecting MONEY entity to override the ORG entity and end offset of MONEY should have been used after merging. Please let me know if I got this wrong.
Thanks a lot for the great and detailed bug report 馃憤 I just tried the example and I can confirm that it's also reproducible in v2.1.0 nightly. The current behaviour is definitely a bug.
In terms of the expected behaviour: I suggest that spaCy should actually raise an error here. By definition, a token can only be part of one entity. So when the doc.ents are set and an overlapping entity exists, there's no clear answer for how this should be resolved. Which entity should take precedence?
So it's probably more intuitive and logical if spaCy raised with a message like: "Trying to set conflicting doc.ents: (0, 4, 'ORG') and (0, 1, 'MONEY'). A token can only be part of one entity, so make sure the entities you're setting don't overlap."
Thanks for quick reply Ines. I think the new span can take preference on existing one as this will help users to create new derived entities.
This is fixed now in spacy-nightly 馃帀 . See also #2566.
I actually am finding the raising of an error here is causing more problems for me than it is resolving. I have many causes where I have for instance both "Jon" and and "Jon Smith" as NAME entities. I do want both tagged as a NAME. I would prefer instead in such cases to tag whichever entity is longer as this is likely to be more specific rather than raising an error. There are many other instances as well and I do not want to change my setup to exclude overlapping.
I would like spaCy to preserve the longer entity as well.
@isaacmg
Both "Jon" and and "Jon Smith" can not be tagged as named since in first case 'Jon' gets a tag of 'U'
whereas in second case a 'B'. Model is bound to be confused.
One way as mentioned already is to take care of the overlaps.
Also there are scenarios where same token / char span is tagged to 2 entities. I guess that comes up as an error now. Not tested.
This is fixed now in
spacy-nightly. See also #2566.
I still have error with this code even after installing spacy-nightly.
ValueError: [E103] Trying to set conflicting doc.ents: '(9, 11, u'ANIMAL')' and '(10, 11, u'ANIMAL')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.
@honnibal I would like an answer about how to override this error message at the very least. As this issue is currently blocking me from using the updated version of Spacy. I have tried a variety of solutions but nothing works (note for me eliminating overlapping entities is not an option). I cannot find any easy way to subclass Doc in order to the change the line of code that this PR implemented. What is the best way to override this behavior short of me maintaining my own separate Spacy branch?
Thanks.
There is a way to handle the error here?
Indeed to have the entity with the maximum length as default would be great.
is there a way to get the ID of the entity causing the error to handle the situation?
I am also running into this as a problem, I have a PhraseMatcher finding entities on top of the NER pipeline, but I would like to be able to link both entities. What's the suggested workaround?
Can someone provide sample code illustrating how to make the longer and/or newer span take precedence? @rbhambriiit, it sounds like you might know how to do this?
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
There is a way to handle the error here?
Indeed to have the entity with the maximum length as default would be great.
is there a way to get the ID of the entity causing the error to handle the situation?