Spacy: Coreference

Created on 10 Feb 2017  路  8Comments  路  Source: explosion/spaCy

Nowadays, only Stanford NLP tools support good coreference solution.
Could you please add this feature into SpaCy? Thanks!

enhancement help wanted

Most helpful comment

@smartschat is interested in porting his CORT toolkit over to spaCy, which would be really awesome. However there's lots of other things to do, and naturally his priority will be to keep his research moving, especially as it's currently submission season.

So: this will happen, it's just a question of schedules. If anyone reading has a commercial need, you might consider placing a bounty. It's a pretty reliable way to shift priorities :).

The good news is that coreference resolution models are finally getting better. I'm looking forward to seeing how the neural network models we're developing for spaCy 2.0 will work in the CORT search framework.

All 8 comments

@smartschat is interested in porting his CORT toolkit over to spaCy, which would be really awesome. However there's lots of other things to do, and naturally his priority will be to keep his research moving, especially as it's currently submission season.

So: this will happen, it's just a question of schedules. If anyone reading has a commercial need, you might consider placing a bounty. It's a pretty reliable way to shift priorities :).

The good news is that coreference resolution models are finally getting better. I'm looking forward to seeing how the neural network models we're developing for spaCy 2.0 will work in the CORT search framework.

Quick update: This might be a nice use case for the new custom processing pipeline components and extension attributes introduced in v2.0!

@ines (and @honnibal ) definitely agree. @benhachey and I are porting some rule-based coreference to use components and extensions. One thing we're thinking about at the moment is where to store things and what to call them. Even if we're not able to release the system (i.e. components), we'd like to make sure that the results are stored in sensible places, so it would be possible to swap different systems in and out and access the clusters using the same interface.

Strawman Data Model

We're representing a document's clusters as a forest of shallow trees of Span nodes, e.g.:

  • Antecedent (John Smith)

    • Referent (Mr Smith)

    • ...

    • Referent (Smith)

We're mainly concerned with proper name coreference, but if you want pronominal/nominal, you can add those mentions to Doc.ents.

Storage-wise, it looks like the data needs to live on the Doc (since Spans are created on-demand). A simple version is two structures containing Span locations (i.e. slice(span.start, span.end)):

  • Doc._._ant_to_refs is a dict mapping a slice for each antecedent to a list of slices for its referents.
  • Doc._._ref_to_ants is a dict mapping a slice for each referent to the slice for its antecedent.

Most coreference algorithms are likely to use storage that is optimised for their needs, so populating the data is likely to be a matter of building the above structures. Alternatively, you could have a CRUD extensions on Spans, which would be nice, but annoying to write.

There are a few ways you might access the data:

  • Span objects in Doc.ents get two new extensions, both of which lookup slices and find the right spans (I assume we can slice into Doc.tokens, create a Span and add the label from the Token...). This handles the use-case of wanting to go from an entity to its cluster:

    • Span._.ant: this is the entity's antecedent, or None if the entity is the antecedent.

    • Span._.refs: this is a list of an entity's referents, or None if the entity is not an antecedent.

  • Doc would get a new extension: Doc._.ants, which yields all antecedents. This handles the use-case of grabbing all clusters and doing something with them.

Outstanding questions

  1. How namespacy should extension names be? Span._.coref.ant, Span._.ant, Span._.coref_ant, Span._.ant, ...? The extension docs talk a bit about component names, but is there a convention you had in mind for Doc and Span extensions?
  2. There are plenty of name choices for clusters, antecedents and referents. Are there some spaCy-preferred ones?
  3. Is the above storage sensible? I can also imagine having volatile indices that are not serialised, but allow efficient access (I wonder if Doc.before_to_bytes() and Doc.before_from_bytes() might be useful hooks).
  4. Syncing with entities is tricky, and you might need hooks like Doc.on_ent_add(), etc... However, I feel like the most pragmatic approach is for developers to be careful and re-run coreference if they add a new entity (or repair the links).

Comments welcome!

I see neuralcoref v2.0 has been released today. Is this still loosely supported as a pipeline component, or at least a way to go for coref? I have a number of code examples locally that use it, however I have difficulty finding any reference to it on the documentation pages currently except for 3 closed issues here.

Yes, for coref, we definitely recommend using neuralcoref v2.0 in combination with spaCy:
馃 https://github.com/huggingface/neuralcoref

There might also be spaCy pipeline components with custom attributes soon, so you can easily plug one of their models into your spaCy pipeline.

I have managed to integrate neuralCoref (single speaker) to spaCy pipeline by redirecting the doc on this line in neuralcoref within my Pipe.__call__(doc)

In order to fully integrate neuralCoref (with multiple speakers) into spaCy the pipeline components, Language would need to be able to consume custom attributes for marking speaker to nlp(text). Something along the line of adding extension to text_holder._.speaker={...} before nlp(text_holder).

Since OntoNote5, the speaker is annotated. This could be useful for future CoNLL tasks.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings