Spacy: (re-opening bug) thinc.extra.search.Beam.advance has assertion error with custom entity labels

Created on 20 Sep 2019 · 8Comments · Source: explosion/spaCy

I couldn't figure out how to re-open an issue, so I am creating a new one with a link to the previous https://github.com/explosion/spaCy/issues/3047.

The behavior is still present.

Basically, here is the line where the assertion is triggered.

https://github.com/explosion/thinc/blob/master/thinc/extra/search.pyx#L149

I can't quite figure out from the code what is the variable size trying to represent here. Perhaps, I can do some workaround, but I need to understand the root cause.

Also, if there is a different recipe for getting the confidence on the extracted entities, I am happy to consider that (I haven't found an alternative)

bug 🔮 thinc

Source

sshegheva

All 8 comments

Hi @sshegheva , thanks for the report! I can't immediately reproduce the bug. Could you provide a minimal working (crashing) code snippet (with toy data), and could you also provide your system information by pasting the output of python -m spacy info --markdown ?

svlandeg on 23 Sep 2019

Spacy's markdown info:

Info about spaCy

spaCy version: 2.1.8
Platform: Darwin-18.6.0-x86_64-i386-64bit
Python version: 3.6.7

Here is a minimal code snippet that reproduces the issue:

from spacy_lookup import Entity
import spacy
nlp = spacy.load("en_core_web_md")
entity = Entity(keywords_list=["gradient", "neural network"], label="ML")
text = "you have to be aware of a vanishing gradient when training a neural network"
nlp.add_pipe(entity, last=True)

docs = list(nlp.pipe([text], disable=["ner"]))
beams, _ = nlp.entity.beam_parse(docs,
                                      beam_width=3,
                                      beam_density=0.001)
entity_scores = defaultdict(float)
for doc, beam in zip(docs, beams):
    for score, ents in nlp.entity.moves.get_beam_parses(beam):
        for start, end, label in ents:
            ent = doc[start:end]
            if ent.text:  # do not write an empty entity
                entity_scores[(ent.text.lower(), label)] += score

And this is the assertion error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-12-f8fac7fe1cd4> in <module>
      2 beams, _ = nlp.entity.beam_parse(docs,
      3                                       beam_width=3,
----> 4                                       beam_density=0.001)
      5 entity_scores = defaultdict(float)
      6 for doc, beam in zip(docs, beams):

nn_parser.pyx in spacy.syntax.nn_parser.Parser.beam_parse()

nn_parser.pyx in spacy.syntax.nn_parser.Parser.transition_beams()

search.pyx in thinc.extra.search.Beam.advance()

AssertionError:

sshegheva on 23 Sep 2019

👍1

Thanks! With your text snippet I can reproduce the problem. Just a small comment: the result of nlp.entity.beam_parse is not a tuple (anymore), so you don't need beams, _ but just beams.

Anyway it looks like something goes wrong during resizing of the NER. You run nlp() on your docs after which some entities have the label "ML", and the NER should account for those labels, but perhaps the update leaves some internal inconsistent state. We'll try and get to the bottom of this.

As a workaround, you could add the Entity pipe only after everything with the beamsearch is done, then disable all components except for entity, and run nlp() again to augment your docs with the dictionary entities only at the end of everything.

svlandeg on 23 Sep 2019

👍1

I think I see the problem here. When we're setting up the beam, we have:

66 for doc in docs:
67     beam = Beam(self.n_moves, beam_width, min_density=beam_density)
68    beam.initialize(self.init_beam_state, doc.length, doc.c)

But inside beam.initialize, we check whether the Doc object has any NER labels we need to add. If it does have those labels, we'll have a mismatch between the number of actions we've told the Beam object about, and the number of actions actually in the parser.

We can't simply update the beam.nr_class variable, as we also would need to resize various buffers within the beam that were set up when the beam was created. It also wouldn't be correct, because we could add some classes later.

It's sort of dumb but I propose to add this at the start of TransitionSystem.init_beams:


# Doc objects might contain labels that we need to register actions for. We need to check for that
# *before* we create any Beam objects, because the Beam object needs the correct number of
# actions. It's sort of dumb, but the best way is to just call init_batch() -- that triggers the additions,
# and it doesn't matter that we create and discard the state objects.
self.init_batch(docs)

honnibal on 27 Sep 2019

@sshegheva In the meantime you should be able to work around the problem by calling nlp.entity.add_label("ML") before you make your call to beam_parse.

honnibal on 27 Sep 2019

That does fix it @honnibal :-)

svlandeg on 27 Sep 2019

Adding a label to the nlp.entity pipeline works, but I am wondering if that is desired.

defaultdict(float,
            {('you', 'ML'): 0.6666666666666666,
             ('have', 'ML'): 0.6666666666666666,
             ('to be', 'ML'): 0.6666666666666666,
             ('aware of', 'ML'): 0.6666666666666666,
             ('a vanishing', 'ML'): 0.6666666666666666,
             ('gradient', 'ML'): 1.0,
             ('when', 'ML'): 0.6666666666666666,
             ('training', 'ML'): 0.3333333333333333,
             ('a', 'ML'): 0.6666666666666666,
             ('neural network', 'ML'): 1.0,
             ('training a', 'ML'): 0.3333333333333333,
             ('you have to', 'ML'): 0.3333333333333333,
             ('be aware', 'ML'): 0.3333333333333333,
             ('of a', 'ML'): 0.3333333333333333,
             ('vanishing', 'ML'): 0.3333333333333333,
             ('when training', 'ML'): 0.3333333333333333})

Look at the output. Now the beam parse thinks that every entity is part of ML while the only terms in the dictionary as re gradient and neural_network

sshegheva on 27 Sep 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.