Spacy: Multiprocessing spaCy: Can't find model 'en_model.vectors' in en_core_web_lg

Created on 8 Apr 2019 · 6Comments · Source: explosion/spaCy

I am trying to process multiple files in parallel (a single NLP instantiation) and do sentence segmentation on them. Every process reads a file, and every line in that file is a JSON string. The JSON contains a text field, which I want to segment.

This seems to work fine with small and medium models, but for the large spaCy model I get the error

OSError: [E050] Can't find model 'en_model.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Looking at the English models, I'd assume that the small model doesn't have vectors - not the large model. On top of that, I'm not sure why vectors are required for this operation. So what is the problem?

I have wondered whether this is actually a memory issue, but running this with 3 threads and 16Gb of RAM, I don't think that should be an issue: even if the model is loaded three times, the memory should be able to hold that.

Finally, if there is a faster way to do only sentence segmentation that's better than the following, please do let me know. I'm also not sure whether having only one nlp instance is good practice in a multiprocessing context. Should it be copied?

docs = list(self.nlp.pipe(lines))
sents = [sent for doc in docs for sent in doc.sents]

How to reproduce the behaviour

Clone this repo.
Install spaCy and the small and large model
Run python debug.py data/raw/ data/articles/ -s en_core_web_lg

That should fail. Use the en_core_web_sm model, and it should run just fine.

Info about spaCy

spaCy version: 2.1.3
Platform: Windows-10-10.0.17134-SP0
Python version: 3.7.3

bug feat / vectors models scaling

Source

BramVanroy

👍4

Most helpful comment

Any updates on the fix for this issue?

kaustubh-ai on 10 Aug 2019

👍3

All 6 comments

I believe this issue is caused by the way thinc and spacy packages are connected.

In Thinc, "load_nlp.py" relies on global variables VECTORS and SPACY_MODELS to obtain the language models, and these global variables are set inside spacy's function "_ml.py" (for VECTORS).

Under Spark or some other parallel packages, global variables are not carried over from main into each map/reduce worker, thus these variables are reset.

You can overcome this issue by placing the nlp = spacy.load() line into map/reduce functions. First, have the nlp = spacy.load('en_core_web_lg') in global scope. Inside map/reduce functions, check if 'thinc.extra.load_nlp.VECTORS' dict is empty or not. This requires an import of thinc package. If it is empty, and you are using a model that contains vectors, call nlp = spacy.load('en_core_web_lg') again. thinc's global variables should be set again and this time persistence within each worker.

Not optimal, but should fix the problem for now.....

wang159 on 23 Jun 2019

👍1

@wang159 Thanks for the work-around. I think your analysis is correct.

The way this works really sucks, and I'm annoyed that I've had so much trouble getting a good solution to this seemingly-simple problem.

For future reference when we're working on this, it's worth noting that we should think about both the multiprocessing and multi-threading cases. In Thinc recently I fixed a similarish problem with the operator overloading, which touches a global variable. This introduced a race condition for multi-threading. I fixed that by making the variable thread-local.

honnibal on 11 Jul 2019

Any updates on the fix for this issue?

kaustubh-ai on 10 Aug 2019

👍3

@wang159 : instead of loading a model again within a worker, is it possible to add the vectors to the empty 'thinc.extra.load_nlp.VECTORS' dict within the worker? Loading the model many times (if there are a lot of workers) will take up a lot of memory, I assume?