Thanks.
What do you mean by a composite entity?
composite entity means nested entity type, an entity having another entity inside it.
e.g. "from [location]" will be one entity that contains another entity location in it
Sounds like a good feature. I've just encountered a somewhat similar case when nested entities could gladly help.
I'm just looking at how to implement this and considering using spaCy. I'll likely train the NER on all the base entities and then use custom code to identify the hierarchies and then merge the span, e.g.
starting with base entities:
I'm at Starbucks ORG 789 CARDINAL Mission ST_NAME St ST_TYPE San Fran CITY
:arrow_down:
Starbucks ORG 789 CARDINAL [Mission St] ST_NAMED San Fran CITY
:arrow_down:
[Starbucks 789 Mission St] ST_ADDRESS San Fran CITY
:arrow_down:
[Starbucks 789 Mission St San Fran] GEO_ADDRESS
Then just use the merged span for further processing (e.g. dependencies). It would be helpful if there was some sort of templating annotation system to do this within spaCy.
Anybody else working on something like this?
Has this been implemented somewhere already? Any solution yet?
It appears nested (or overlapping) entities were disallowed in spacy via #2880.
EDIT: Nested entities were never implemented. What was disallowed was setting the entity type attribute of a token span that intersected with a token span already containing a set entity type attribute. Such a use case can be met by utilizing the v2.1 Matcher functionality of patterns with custom token attributes (set to the other entities).
I have several qualms with this:
1) This limits named entity recognition to be a multi-class classification problem as opposed to the more general multi-label classification problem - dependency information between entities (labels) is lost if entities of interest naturally occur inside one another or overlap.
2) This commit has limited the capabilities of using Matcher with entities. For instance, I have an entity mass_unit that I would like to use as part of a pattern for matching a number followed by a unit like so:
[{'LIKE_NUM': True}, {'ENT_TYPE': 'mass_unit'}],
If I wanted to label all Spans that match this pattern as a new entity - I can no longer do this as the new entity will overlap with the existing entity mass_unit thus raising an error.
3) Nested entities arise in many NER applications - implicitly disallowing them limits the capabilities of the package. I wrote a fair amount of code to overcome to this in my implementations but at the end just chose to not support past the version that had this commit which is not a long term solution.
From my readings of issue discussions, it appears this functionality was implemented in-part to solve rendering issues with displaCy alongside address some span-mangling issues.
Related issues: #2550
@honnibal What are your thoughts?
@AndriyMulyar I agree with this. This new approach made me have to downgrade Spacy as now I cannot do basic things like tag both "Dan Johnson" and "Dan" as NAME due to overlap. In my case I need an option to tag the longest entity in cases of overlap. More generally speaking though I think there should be a parameter for users to pass in to specify what they want to do in cases in overlap (i.e. raise error, longest, shortest, custom option, etc).
What is the status here? Was there any response pertaining nested entities (please no hacks!). It's a very important factor to train models into understanding context.
@datascienceteam01 In my current project medaCy I got around this by completely ignoring the entity handling functionality of spaCy and writing my own. It still works fast - even at scale (thousands of documents) - and is able to interface with spaCy models. Although my project and code is engineered to the NLP domain at hand, there are ways to get around it and I hope it can be used as an example. Unfortunately, this means either not upgrading past spaCy v2.0.13 where the hard error was introduced or not using the excellent Matcher functionality. I chose the former route.
@AndriyMulyar I'm confused as to how this ever worked. The entities have always been stored on the tokens using two attributes: ent_iob and ent_type. Each token can only receive one ent_iob value, indicating whether an entity starts, ends, or is internal to an entity, and indicating the type. So the implementation has never had a way of storing nested named entities.
What you should do if you need nested named entities is add a custom attribute, and store them there. In v2.1 you'll also be able to use the Matcher over the values in extension attributes as well.
I'm really not sure how your code is working in v2.0.12.
@honnibal The merge I referenced above implemented the throwing of a hard error when attempting to set an entity tag onto a token that already had an entity tag. It appears that what was actually happening was that the entity tag was being overridden (which happened to be the behavior desired) - not that multiple entity tags were being set for a given token.
The referenced improved functionality in v2.1 for Matcher seems like it will provide a sufficient solution to this use case. This thread can probably be closed.
@andriymulyar if you just want to overwrite the entity tag, you can just reconcile the entities as you want them before assigning to doc.ents? If you don't need actual overlap or nesting there should be no problem.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
composite entity means nested entity type, an entity having another entity inside it.
e.g. "from [location]" will be one entity that contains another entity location in it