I am trying to process multiple files in parallel (a single NLP instantiation) and do sentence segmentation on them. Every process reads a file, and every line in that file is a JSON string. The JSON contains a text field, which I want to segment.
This seems to work fine with small and medium models, but for the large spaCy model I get the error
OSError: [E050] Can't find model 'en_model.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Looking at the English models, I'd assume that the small model doesn't have vectors - not the large model. On top of that, I'm not sure why vectors are required for this operation. So what is the problem?
I have wondered whether this is actually a memory issue, but running this with 3 threads and 16Gb of RAM, I don't think that should be an issue: even if the model is loaded three times, the memory should be able to hold that.
Finally, if there is a faster way to do only sentence segmentation that's better than the following, please do let me know. I'm also not sure whether having only one nlp instance is good practice in a multiprocessing context. Should it be copied?
docs = list(self.nlp.pipe(lines))
sents = [sent for doc in docs for sent in doc.sents]
python debug.py data/raw/ data/articles/ -s en_core_web_lgThat should fail. Use the en_core_web_sm model, and it should run just fine.
I believe this issue is caused by the way thinc and spacy packages are connected.
In Thinc, "load_nlp.py" relies on global variables VECTORS and SPACY_MODELS to obtain the language models, and these global variables are set inside spacy's function "_ml.py" (for VECTORS).
Under Spark or some other parallel packages, global variables are not carried over from main into each map/reduce worker, thus these variables are reset.
You can overcome this issue by placing the nlp = spacy.load() line into map/reduce functions. First, have the nlp = spacy.load('en_core_web_lg') in global scope. Inside map/reduce functions, check if 'thinc.extra.load_nlp.VECTORS' dict is empty or not. This requires an import of thinc package. If it is empty, and you are using a model that contains vectors, call nlp = spacy.load('en_core_web_lg') again. thinc's global variables should be set again and this time persistence within each worker.
Not optimal, but should fix the problem for now.....
@wang159 Thanks for the work-around. I think your analysis is correct.
The way this works really sucks, and I'm annoyed that I've had so much trouble getting a good solution to this seemingly-simple problem.
For future reference when we're working on this, it's worth noting that we should think about both the multiprocessing and multi-threading cases. In Thinc recently I fixed a similarish problem with the operator overloading, which touches a global variable. This introduced a race condition for multi-threading. I fixed that by making the variable thread-local.
Any updates on the fix for this issue?
@wang159 : instead of loading a model again within a worker, is it possible to add the vectors to the empty 'thinc.extra.load_nlp.VECTORS' dict within the worker? Loading the model many times (if there are a lot of workers) will take up a lot of memory, I assume?
Merging with #4349. Thanks for your patience on this, I agree that it's a frustrating bug.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Any updates on the fix for this issue?