I'm trying to load a self-trained TextCategorizer model, and run it for an already pre-processed documents (e.g. by using textcat.pipe(docs))
I started with this tutorial: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py
The training and evaluation works just fine, but when I try to reload the pipeline/model I get stuck.
I can not store the model, only the pipeline (workaround: disable the other components). This shouldn't be an issue, I just load it and then select the textcat pipeline. But when I try to apply it on documents, I receive the error:
File "<..>\Python37\lib\site-packages\thinc\api.py", line 295, in begin_update
X, bp_layer = layer.begin_update(layer.ops.flatten(seqs_in, pad=pad), drop=drop)
File "ops.pyx", line 150, in thinc.neural.ops.Ops.flatten
File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate
My class where I try to load the trained TextCategorizer:
class SentenceSplitter:
def __init__(self, path):
self.loaded_pipeline = spacy.load(path)
self.categorizer = self.loaded_pipeline.get_pipe(TextCategorizer.name)
def __cal__(doc):
for sentence in doc.sents:
self.categorizer.pipe([sentence.as_doc()]) # This is not possible/fails
More or less (e.g. the model is created and not loaded), the exact same thing is done during training and there it works.
Any ideas?
Hi, can you provide a minimal working example that shows this error? Here's a test case that works fine for me locally with v2.2.4, with a model trained with the example train_textcat.py script saved with -o /tmp/textcat:
import spacy
class MyComponent:
def __init__(self, textcat):
self.textcat = textcat
def __call__(self, doc):
for sent_doc in self.textcat.pipe([sent.as_doc() for sent in doc.sents]):
print(sent_doc.text, sent_doc.cats)
return doc
nlp = spacy.load("/tmp/textcat")
my_component = MyComponent(nlp.get_pipe("textcat"))
nlp.remove_pipe("textcat")
nlp.add_pipe(nlp.create_pipe("sentencizer"))
nlp.add_pipe(my_component)
text = "This is one sentence. This is another. This movie was terrible."
doc = nlp(text)
Output (I only trained the textcat model for one iteration, so the performance is not good):
This is one sentence. {'POSITIVE': 0.5069777369499207, 'NEGATIVE': 0.49302226305007935}
This is another. {'POSITIVE': 0.5093674063682556, 'NEGATIVE': 0.4906326234340668}
This movie was terrible. {'POSITIVE': 0.4363318383693695, 'NEGATIVE': 0.5636681914329529}
It's possible you're running into problems related to the vocab, although I wouldn't have expected that particular kind of thinc error in this case. You want to make sure that when you use textcat(doc) or textcat.pipe(docs) that the textcat component and the doc share the same vocab. Either you need to make sure both the textcat component and the main pipeline share the same vocab (as in the example above), or you want to apply a whole separate pipeline to the text in your component, something more like:
class MyComponent:
def __init__(self, path):
self.nlp = spacy.load(path)
def __call__(self, doc):
for sent_doc in self.nlp.pipe([sent.text for sent in doc.sents]):
print(sent_doc.text, sent_doc.cats)
return doc
textcat.pipe() is only faster if you can provide multiple docs as input so it can process them in batches. If you only have one doc at a time, you can just use nlp.get_pipe("textcat")(doc) instead (__call__ instead of pipe).
@adrianeboyd Thanks a lot for your answer and the time you invested!
Your hint with the vocab was the right thought: for some reasons I load the model (in my case 'de_core_news_md') at least twice, which leads to a renaming and the error message.
I tried it out with the output of the original train_textcat.py script and with that I don't get any error messages and can run it. Now I only need to debug why the model is loaded multiple times. Again, thank you! I didn't make the connection to the vocab based on the error message.
Followup if anyone has similar issues:
Besides the vocabulary issue I had, there is also an other issue: when using the architecture ensemble I receive the same error message as mentioned in my first post. For both, bow and simple_cnn it works perfectly fine.
Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the ensemble architecture? If so - could you post a minimal code snippet that reproduces the error?
Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the
ensemblearchitecture? If so - could you post a minimal code snippet that reproduces the error?
Wrong aftek ;)
@afftek
Ooooh, apologies!
Hi @aftek, do I understand it correctly that you're still experiencing a (different) issue with the
ensemblearchitecture? If so - could you post a minimal code snippet that reproduces the error?
yes, I do have issues with ensemble - same error message, not sure what the issue itself is.
I will try to find the minimal code which fails and post it as soon as I have it.
As there hasn't been any follow-up, I'll close this issue in the meantime, as the original issue was solved. If there are still other problems, please feel free to open a new issue and include a reproducible snippet, thanks!