For my application, I'm loading 8 language models simultaneously. The models are loaded as follows:
from concurrent.futures import ThreadPoolExecutor
def init_nlp_provider(language):
logging.debug(f'Language with "{language.language_code}": Loading...')
language.init_nlp_provider()
logging.debug(f'Language with "{language.language_code}": Done')
def initialize_languages() -> List[ALanguage]:
languages = [
DutchLanguage(),
EnglishLanguage(),
FrenchLanguage(),
GermanLanguage(),
GreekLanguage(),
ItalianLanguage(),
PortugueseLanguage(),
SpanishLanguage(),
]
with ThreadPoolExecutor(max_workers=1) as executor:
for language in languages:
executor.submit(init_nlp_provider, language)
return languages
Each of the language wrappers above quickly delegates to spacy.load(...) as illustrated by the implementation for the EnglishLanguage class below:
def init_nlp_provider(self):
self._nlp_provider = spacy.load('en_core_web_sm')
This all works well when max_workers=1; however it takes almost 10 minutes to load all language models.
The real problem starts when I set max_workers to 2 or more. This action results in the following exception:
Undefined operator: >>
Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x000001BAFD3CCD08>, <thinc.neural._classes.feed_forward.FeedForward object at 0x000001BAFD3833C8>)
Available:
Traceback:
├─ <lambda> [782] in C:\Python37\lib\site-packages\spacy\language.py
├─── from_disk [611] in C:\Python37\lib\site-packages\spacy\util.py
└───── build_tagger_model [511] in C:\Python37\lib\site-packages\spacy\_ml.py
>>> pretrained_vectors=pretrained_vectors,
Should be fixed by v2.1.8. The issue was that Thinc used a global variable in a context manager when defining its models. I added some logic to make that a thread-local variable.
I'm sorry to say that I still believe this to be a problem on v2.1.8.
In addition to the setup above, I have now prepared a main.py that you can simply run.
import time
from concurrent.futures import ThreadPoolExecutor
import spacy
MAX_WORKERS = None
model_names = [
'nl_core_news_sm',
'en_core_web_sm',
'fr_core_news_sm',
'de_core_news_sm',
'el_core_news_sm',
'it_core_news_sm',
'pt_core_news_sm',
'es_core_news_sm',
]
models = [None] * len(model_names)
def get_time() -> str:
return time.strftime("%H:%M:%S:%MS", time.localtime())
def init_nlp_provider(index: int, model_name: str):
print(f'\n{get_time()}: Language with "{model_name}": Loading...', flush=True)
models[index] = spacy.load(model_name)
print(f'\n{get_time()}: Language with "{model_name}": Done', flush=True)
def main():
print(spacy.info())
print(f"{get_time()}: Loading models...")
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
for index, model_name in enumerate(model_names):
executor.submit(init_nlp_provider, index, model_name)
print(f"{get_time()}: Loaded all models.")
if __name__ == "__main__":
main()
This produces the following output:
python .\main.py
============================== Info about spaCy ==============================
spaCy version 2.1.8
Location C:\Python37\lib\site-packages\spacy
Platform Windows-10-10.0.18990-SP0
Python version 3.7.4
Models
{'spaCy version': '2.1.8', 'Location': 'C:\\Python37\\lib\\site-packages\\spacy', 'Platform': 'Windows-10-10.0.18990-SP0', 'Python version': '3.7.4', 'Models': ''}
12:59:55:59S: Loading models...
12:59:55:59S: Language with "nl_core_news_sm": Loading...
12:59:55:59S: Language with "en_core_web_sm": Loading...
12:59:55:59S: Language with "fr_core_news_sm": Loading...
12:59:55:59S: Language with "de_core_news_sm": Loading...
12:59:55:59S: Language with "el_core_news_sm": Loading...
12:59:55:59S: Language with "it_core_news_sm": Loading...
12:59:55:59S: Language with "pt_core_news_sm": Loading...
12:59:55:59S: Language with "es_core_news_sm": Loading...
13:00:02:00S: Language with "it_core_news_sm": Done
13:00:02:00S: Language with "pt_core_news_sm": Done
13:00:02:00S: Language with "nl_core_news_sm": Done
13:00:02:00S: Language with "es_core_news_sm": Done
13:00:02:00S: Language with "el_core_news_sm": Done
13:00:02:00S: Language with "fr_core_news_sm": Done
13:00:02:00S: Loaded all models.
You'll notice that while 8 models are indicated as "Loading...", only 6 are marked "Done".
When running from my actual application, (not the test script I provided), I receive the following stack trace:
File "C:\Python37\lib\site-packages\spacy\__init__.py", line 27, in load
return util.load_model(name, **overrides)
File "C:\Python37\lib\site-packages\spacy\util.py", line 134, in load_model
return load_model_from_package(name, **overrides)
File "C:\Python37\lib\site-packages\spacy\util.py", line 155, in load_model_from_package
return cls.load(**overrides)
File "C:\Python37\lib\site-packages\nl_core_news_sm\__init__.py", line 12, in load
return load_model_from_init_py(__file__, **overrides)
File "C:\Python37\lib\site-packages\spacy\util.py", line 196, in load_model_from_init_py
return load_model_from_path(data_path, meta, **overrides)
File "C:\Python37\lib\site-packages\spacy\util.py", line 179, in load_model_from_path
return nlp.from_disk(model_path)
File "C:\Python37\lib\site-packages\spacy\language.py", line 836, in from_disk
util.from_disk(path, deserializers, exclude)
File "C:\Python37\lib\site-packages\spacy\util.py", line 636, in from_disk
reader(path / key)
File "C:\Python37\lib\site-packages\spacy\language.py", line 831, in <lambda>
p, exclude=["vocab"]
File "pipes.pyx", line 641, in spacy.pipeline.pipes.Tagger.from_disk
File "C:\Python37\lib\site-packages\spacy\util.py", line 636, in from_disk
reader(path / key)
File "pipes.pyx", line 620, in spacy.pipeline.pipes.Tagger.from_disk.load_model
File "pipes.pyx", line 530, in spacy.pipeline.pipes.Tagger.Model
File "C:\Python37\lib\site-packages\spacy\_ml.py", line 523, in build_tagger_model
pretrained_vectors=pretrained_vectors,
File "C:\Python37\lib\site-packages\spacy\_ml.py", line 361, in Tok2Vec
(norm | prefix | suffix | shape)
File "C:\Python37\lib\site-packages\thinc\check.py", line 129, in checker
raise UndefinedOperatorError(op, instance, args[0], instance._operators)
thinc.exceptions.UndefinedOperatorError:
Undefined operator: |
Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x000001A6847EE148>, <thinc.neural._classes.hash_embed.HashEmbed object at 0x000001A68CAF5988>)
Available:
Traceback:
Γö£ΓöÇ <lambda> [831] in C:\Python37\lib\site-packages\spacy\language.py
Γö£ΓöÇΓöÇΓöÇ from_disk [636] in C:\Python37\lib\site-packages\spacy\util.py
ΓööΓöÇΓöÇΓöÇΓöÇΓöÇ build_tagger_model [523] in C:\Python37\lib\site-packages\spacy\_ml.py
>>> pretrained_vectors=pretrained_vectors,
Please would you run the test script to confirm the existence of a problem?
I believe the problem does not manifest with MAX_WORKER=1.
I think this is a duplicate of #3690 (the same issue Matt mentioned above). I can replicate it just loading multiple en models.
~FYI: I just upgraded to 2.2. and this issue no longer appears to be a problem.~
Sorry. I spoke too soon.
No, I tested with 2.2, too. It's clearly not deterministic.
Cross-referencing #3552 here as well, as the two issues should probably be solved together.
I don't think it's a particularly good solution, but I think a quick way to fix this at least temporarily is to make the saved copy of the operators thread-local, too. Otherwise it looks like it's getting clobbered.
Hey guys,
I am reporting the same issue than @ericmclachlan.
I am running a ML API using gunicorn and a call to SpaCy is made during preprocessing. I am unfortunately facing an error when I want to load SpaCy and setting the gunicorn number of workers to more than one. The following simple code leads to the same result:
Code:
@app.route("/", methods=['GET', 'POST'])
def predictor():
text = requests.form['text']
nlp = spacy.load('xx_ent_wiki_sm')
doc = nlp(text)
return [str(e) for e in doc.ents if e.label_ == 'PER']
However, this error does not occur anymore when the number of gunicorn workers is set to 1.
Config:
python 3.7.5
gunicorn==19.9.0
spacy==2.2.3
thinc==7.3.1
Error:
[1;38;5;1mUndefined operator: >> [0m
Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x3e84bf6cee90>, <thinc.neural._classes.feed_forward.FeedForward object at 0x3e84bf2d9610>)
Available:
[1;38;5;4mTraceback: [0m
├─ [1mfrom_disk [0m [654] in /usr/local/lib/python3.7/site-packages/spacy/util.py
├─── [1m<lambda> [0m [936] in /usr/local/lib/python3.7/site-packages/spacy/language.py
└───── [1mTok2Vec [0m [323] in /usr/local/lib/python3.7/site-packages/spacy/_ml.py
The way I found to get around this is to set the gunicorn workers class to sync.
This is a bug related to multithreading in thinc and should hopefully be fixed soon by this PR: https://github.com/explosion/thinc/pull/124
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.