Spacy: Not able to run textcat.pipe() after loading TextCategorizer from disk

Created on 17 Nov 2020 · 8Comments · Source: explosion/spaCy

Issue

I'm trying to load a self-trained TextCategorizer model, and run it for an already pre-processed documents (e.g. by using textcat.pipe(docs))
I started with this tutorial: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py

The training and evaluation works just fine, but when I try to reload the pipeline/model I get stuck.
I can not store the model, only the pipeline (workaround: disable the other components). This shouldn't be an issue, I just load it and then select the textcat pipeline. But when I try to apply it on documents, I receive the error:

  File "<..>\Python37\lib\site-packages\thinc\api.py", line 295, in begin_update
    X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
  File "ops.pyx", line 150, in thinc.neural.ops.Ops.flatten
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

My class where I try to load the trained TextCategorizer:

class SentenceSplitter:

    def __init__(self, path):
        self.loaded_pipeline = spacy.load(path)
        self.categorizer = self.loaded_pipeline.get_pipe(TextCategorizer.name)

    def __cal__(doc):
        for sentence in doc.sents:
            self.categorizer.pipe([sentence.as_doc()])  # This is not possible/fails

More or less (e.g. the model is created and not loaded), the exact same thing is done during training and there it works.
Any ideas?

Environment

spaCy version: 2.2.4
Platform: Windows-10-10.0.19041-SP0
Python version: 3.7.9

feat / serialize feat / textcat

Source

afftek

All 8 comments

Hi, can you provide a minimal working example that shows this error? Here's a test case that works fine for me locally with v2.2.4, with a model trained with the example train_textcat.py script saved with -o /tmp/textcat:

import spacy

class MyComponent:
    def __init__(self, textcat):
        self.textcat = textcat

    def __call__(self, doc):
        for sent_doc in self.textcat.pipe([sent.as_doc() for sent in doc.sents]):
            print(sent_doc.text, sent_doc.cats)
        return doc

nlp = spacy.load("/tmp/textcat")
my_component = MyComponent(nlp.get_pipe("textcat"))
nlp.remove_pipe("textcat")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(my_component)

text = "This is one sentence. This is another. This movie was terrible."

doc = nlp(text)

Output (I only trained the textcat model for one iteration, so the performance is not good):

This is one sentence.  {'POSITIVE': 0.5069777369499207, 'NEGATIVE': 0.49302226305007935}
This is another.  {'POSITIVE': 0.5093674063682556, 'NEGATIVE': 0.4906326234340668}
This movie was terrible. {'POSITIVE': 0.4363318383693695, 'NEGATIVE': 0.5636681914329529}

It's possible you're running into problems related to the vocab, although I wouldn't have expected that particular kind of thinc error in this case. You want to make sure that when you use textcat(doc) or textcat.pipe(docs) that the textcat component and the doc share the same vocab. Either you need to make sure both the textcat component and the main pipeline share the same vocab (as in the example above), or you want to apply a whole separate pipeline to the text in your component, something more like:

class MyComponent:
    def __init__(self, path):
        self.nlp = spacy.load(path)

    def __call__(self, doc):
        for sent_doc in self.nlp.pipe([sent.text for sent in doc.sents]):
            print(sent_doc.text, sent_doc.cats)
        return doc

textcat.pipe() is only faster if you can provide multiple docs as input so it can process them in batches. If you only have one doc at a time, you can just use nlp.get_pipe("textcat")(doc) instead (__call__ instead of pipe).

adrianeboyd on 17 Nov 2020

@adrianeboyd Thanks a lot for your answer and the time you invested!
Your hint with the vocab was the right thought: for some reasons I load the model (in my case 'de_core_news_md') at least twice, which leads to a renaming and the error message.
I tried it out with the output of the original train_textcat.py script and with that I don't get any error messages and can run it. Now I only need to debug why the model is loaded multiple times. Again, thank you! I didn't make the connection to the vocab based on the error message.

afftek on 17 Nov 2020

👍1

Followup if anyone has similar issues:
Besides the vocabulary issue I had, there is also an other issue: when using the architecture ensemble I receive the same error message as mentioned in my first post. For both, bow and simple_cnn it works perfectly fine.

afftek on 18 Nov 2020

Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the ensemble architecture? If so - could you post a minimal code snippet that reproduces the error?

svlandeg on 7 Dec 2020

Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the ensemble architecture? If so - could you post a minimal code snippet that reproduces the error?

Wrong aftek ;)

@afftek

aftek on 7 Dec 2020

Ooooh, apologies!

svlandeg on 7 Dec 2020

😄1

Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the ensemble architecture? If so - could you post a minimal code snippet that reproduces the error?

yes, I do have issues with ensemble - same error message, not sure what the issue itself is.
I will try to find the minimal code which fails and post it as soon as I have it.

afftek on 9 Dec 2020

👍1

As there hasn't been any follow-up, I'll close this issue in the meantime, as the original issue was solved. If there are still other problems, please feel free to open a new issue and include a reproducible snippet, thanks!

svlandeg on 22 Dec 2020

Was this page helpful?

0 / 5 - 0 ratings