Spacy: doc.merge() doesn't resize doc.tensor

Created on 9 Feb 2018 · 14Comments · Source: explosion/spaCy

I have been getting the following error after I added new components to the pipeline. There is a new entity also added. Any help is much appreciated!

Also I tried running the example in the github clone and it worked well. The model I am using is a model trained(using prodigy) in the new label on top of en_core_web_lg.

doc = nlp(text)
File "/home/sandeep/anaconda2/lib/python2.7/site-packages/spacy/language.py", line 333, in __call__
doc = proc(doc)
File "nn_parser.pyx", line 331, in spacy.syntax.nn_parser.Parser.__call__
File "nn_parser.pyx", line 762, in spacy.syntax.nn_parser.Parser.set_annotations
File "doc.pyx", line 846, in spacy.tokens.doc.Doc.extend_tensor
File "/home/sandeep/anaconda2/lib/python2.7/site-packages/numpy/core/shape_base.py", line 288, in hstack

return _nx.concatenate(arrs, 1)

ValueError: all the input array dimensions except for the concatenation axis must match exactly

My component class is as follows. I have one component class for each of the 4 labels (one of the labels is new one).

class GazzeteerComponent(object):
    def __init__(self,nlp,entityname):
        self.vocab = nlp.vocab
        self.entityname = entityname
        self.label = nlp.vocab.strings[entityname]
        self.__name__ = entityname+'GazzeteerComponent'
        self.matcher = PhraseMatcher(nlp.vocab)
        self.nlp = nlp
        patterns = self.getpatterns(entityname)
        exceptioncount=0
        for pattern in patterns:
            try:
                self.matcher.add(entityname,None,pattern)
            except ValueError as e:
                exceptioncount+=1
                # LOGGER_Generic.error("Exception {} for pattern {} for entity {}".format(e,pattern,entityname))

    def getpatterns(self,entityname):
        fname = 'patterns_{}.list'.format(entityname)
        with open(fname) as entityf:
            alldict = [ensureUtf(text.strip()) for text in entityf.readlines()]

        patterns = [self.nlp.make_doc(pattern) for pattern in alldict]
        return patterns

    def __call__(self,doc):
        matches = self.matcher(doc)
        spans = []
        try:
            for _, start, end in matches:
                entity = Span(doc,start, end, label=self.label)
                spans.append(entity)
                doc.ents = list(doc.ents) + [entity]
            for span in spans:
                span.merge()
        except Exception as e:
            LOGGER_Generic.info("Exception in GazzeteerComponent {}".format(e))

        return doc

Environment

Info about spaCy

Python version: 2.7.13
Platform: Linux-4.4.0-104-generic-x86_64-with-debian-jessie-sid
spaCy version: 2.0.2
Models: en_core_web_lg, en

bug feat / pipeline help wanted help wanted (easy)

Source

SandeepNaidu

All 14 comments

Found that this is happening if ner component is after the PhraseMatcher component. If I put PhraseMatchers after NER component it works fine. Is this a bug that needs a fix?

SandeepNaidu on 10 Feb 2018

Thanks.

Yes, this is a bug: during the pipeline there's a doc.tensor attribute that has one row per token. The .merge() method should be modifying this tensor, but isn't. Subsequent pipeline components then can't extend the tensor, because the number of tokens is wrong.

To work around the bug, you could run your matcher component first in the pipeline, or last. You could also have your component set doc.tensor = None, to avoid the problem.

honnibal on 10 Feb 2018

Thanks @honnibal. Workaround works. But how do I overwrite an entity assigned by previous component? In the example code it is about adding to the existing entities. I am using phrasematcher to boost confidence to reduce false positives in ner component or previous phrasematchers.

# Overwrite doc.ents and add entity – be careful not to replace!
doc.ents = list(doc.ents) + [entity]

SandeepNaidu on 12 Feb 2018

@SandeepNaidu If you actually want to replace the entities assigned by previous components, you can also just write to doc.ents and replace it with your own list. The main reason we've added the "be careful not to replace!" comment is that in most cases, users don't actually want that – so setting doc.ents = [entity] is a common mistake that can easily lead to confusing results.

ines on 12 Feb 2018

Thanks @ines. That assumes that we find total list of entities in the current component. However I might have ent1(label1) and ent2(label1) detected in the previous components. And ent2(label2) is detected in the current component for a different label. I want to retain ent1(label1) and overwrite ent2(label1) with ent2(label2). With the above method I think that does not happen. I wrote some code and it does not seem to work that way. I am still debugging but I am thinking this is what is happening.

SandeepNaidu on 13 Feb 2018

@SandeepNaidu The existing entities in doc.ents are also regular Spans, so you could filter them within your custom component? For example, you could keep all entities whose text isn't equal to the text of the entity your new component just detected, and add your new entity instead:

entity = Span(doc,start, end, label=self.label)  # the new entity detected by your custom component
new_ents = [ent for ent in doc.ents if ent.text != entity.text] + [entity]
doc.ents = new_ents

You could also check the ent.label_ and compare that to the new label – for example, to only overwrite entities which your custom component assigned a different label to. (I hope I understood your question correctly btw!)

ines on 13 Feb 2018

@ines Got it. Thanks for the answer. I am using below code to add a filtering component which excludes unwanted labels, gives more control on priority assignment. Pasting it here.

class FilterEntsComponent(object):
    def __call__(self,doc):
        allents = list()
        entitytaglist = ['ORG','PRODUCT','SKILL', 'PERSON']
        excludelist = ['WORK_OF_ART']
        for ent in doc.ents:
            if ent.label_ in excludelist:
                continue

            if ent.label_ not in entitytaglist:
                entitytaglist.append(ent.label_)
            allents.append((ent.start, ent.end,ent.label_,entitytaglist.index(ent.label_),ent))

        allents = pd.DataFrame(allents)
        allents.sort_values([0,3],inplace=True)
        allents.drop_duplicates(0,inplace=True)
        filteredents = allents[4].tolist()
        doc.ents = filteredents
        return doc

SandeepNaidu on 13 Feb 2018

👍1

Hi @honnibal and @ines, thank you very much for your great, really appreciated work!
I'm facing the very same problem discussed here. However, I cannot put my component first or last in the pipeline. I'm trying to put it right after the tagger, but before the parser. In fact, my component needs tag and pos annotations to perform some checks before constructing the dependency tree.

Can you suggest me a workaround if the problem is not solved yet?

Thank you!

alanramponi on 26 Mar 2018

@alanramponi Thanks! What exactly does your custom component do? Are you also matching spans and merging tokens? The error mentioned above is caused by the tensor not being modified correctly by the merge() method. Does the following work for you?

You could also have your component set doc.tensor = None, to avoid the problem.

ines on 27 Mar 2018

Hi @ines, my custom component performs merging of tokens using the Matcher on some linguistic features including POS tags. It is a kind of chunker based on all the available annotations before parsing and ner modules. This choice helps a lot for my use case in building better dependency parses.

If I set doc.tensor = None, I get the following error:

File "/[MY_LIB_PATH]/python3.6/site-packages/spacy/language.py", line 341, in __call__ doc = proc(doc)
File "nn_parser.pyx", line 338, in spacy.syntax.nn_parser.Parser.__call__
File "nn_parser.pyx", line 786, in spacy.syntax.nn_parser.Parser.set_annotations
File "doc.pyx", line 875, in spacy.tokens.doc.Doc.extend_tensor
AttributeError: 'NoneType' object has no attribute 'size'

However, I found a temporary workaround, i.e., setting doc.tensor = numpy.zeros((0,), dtype='float32') instead of doc.tensor = None.

Hope it can help!

alanramponi on 27 Mar 2018

👍1

Ah, yes, this makes sense – thanks a lot for sharing your workaround!

ines on 27 Mar 2018

The code for the retokenization lives in spacy/tokens/_retokenize.pyx. It should be pretty easy to collapse the tensor rows for the merged regions. We probably want to just use the last row in the merged region as the row for the whole region.

honnibal on 20 Dec 2018

Fixed!

honnibal on 30 Dec 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.