Spacy: Exception When Loading SpaCy Models in Parallel Threads

Created on 1 Oct 2019  Â·  10Comments  Â·  Source: explosion/spaCy

How to Reproduce the Behavior

For my application, I'm loading 8 language models simultaneously. The models are loaded as follows:

from concurrent.futures import ThreadPoolExecutor

def init_nlp_provider(language):
    logging.debug(f'Language with "{language.language_code}": Loading...')
    language.init_nlp_provider()
    logging.debug(f'Language with "{language.language_code}": Done')

def initialize_languages() -> List[ALanguage]:
    languages = [
        DutchLanguage(),
        EnglishLanguage(),
        FrenchLanguage(),
        GermanLanguage(),
        GreekLanguage(),
        ItalianLanguage(),
        PortugueseLanguage(),
        SpanishLanguage(),
    ]
    with ThreadPoolExecutor(max_workers=1) as executor:
        for language in languages:
            executor.submit(init_nlp_provider, language)

    return languages

Each of the language wrappers above quickly delegates to spacy.load(...) as illustrated by the implementation for the EnglishLanguage class below:

    def init_nlp_provider(self):
        self._nlp_provider = spacy.load('en_core_web_sm')

This all works well when max_workers=1; however it takes almost 10 minutes to load all language models.

The real problem starts when I set max_workers to 2 or more. This action results in the following exception:

Undefined operator: >>
  Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x000001BAFD3CCD08>, <thinc.neural._classes.feed_forward.FeedForward object at 0x000001BAFD3833C8>)
  Available: 

  Traceback:
  ├─ <lambda> [782] in C:\Python37\lib\site-packages\spacy\language.py
  ├─── from_disk [611] in C:\Python37\lib\site-packages\spacy\util.py
  └───── build_tagger_model [511] in C:\Python37\lib\site-packages\spacy\_ml.py
         >>> pretrained_vectors=pretrained_vectors,

Your Environment

  • spaCy version: 2.1.3
  • Platform: Windows-10-10.0.18990-SP0
  • Python version: 3.7.4

Specific Questions

  1. Should it be possible to load all language models concurrently as I've tried to do above?
  2. And, if not, what is the sub-10-minute solution to loading all language models?
bug models perf / speed scaling

All 10 comments

Should be fixed by v2.1.8. The issue was that Thinc used a global variable in a context manager when defining its models. I added some logic to make that a thread-local variable.

I'm sorry to say that I still believe this to be a problem on v2.1.8.

In addition to the setup above, I have now prepared a main.py that you can simply run.

import time

from concurrent.futures import ThreadPoolExecutor

import spacy

MAX_WORKERS = None

model_names = [
    'nl_core_news_sm',
    'en_core_web_sm',
    'fr_core_news_sm',
    'de_core_news_sm',
    'el_core_news_sm',
    'it_core_news_sm',
    'pt_core_news_sm',
    'es_core_news_sm',
]

models = [None] * len(model_names) 

def get_time() -> str:
    return time.strftime("%H:%M:%S:%MS", time.localtime())

def init_nlp_provider(index: int, model_name: str):
    print(f'\n{get_time()}: Language with "{model_name}": Loading...', flush=True)
    models[index] = spacy.load(model_name)
    print(f'\n{get_time()}: Language with "{model_name}": Done', flush=True)


def main():

    print(spacy.info())

    print(f"{get_time()}: Loading models...")

    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        for index, model_name in enumerate(model_names):
            executor.submit(init_nlp_provider, index, model_name)

    print(f"{get_time()}: Loaded all models.")

if __name__ == "__main__":
    main()

This produces the following output:

python .\main.py

============================== Info about spaCy ==============================

spaCy version    2.1.8
Location         C:\Python37\lib\site-packages\spacy
Platform         Windows-10-10.0.18990-SP0
Python version   3.7.4
Models

{'spaCy version': '2.1.8', 'Location': 'C:\\Python37\\lib\\site-packages\\spacy', 'Platform': 'Windows-10-10.0.18990-SP0', 'Python version': '3.7.4', 'Models': ''}
12:59:55:59S: Loading models...
12:59:55:59S: Language with "nl_core_news_sm": Loading...
12:59:55:59S: Language with "en_core_web_sm": Loading...
12:59:55:59S: Language with "fr_core_news_sm": Loading...
12:59:55:59S: Language with "de_core_news_sm": Loading...
12:59:55:59S: Language with "el_core_news_sm": Loading...
12:59:55:59S: Language with "it_core_news_sm": Loading...
12:59:55:59S: Language with "pt_core_news_sm": Loading...
12:59:55:59S: Language with "es_core_news_sm": Loading...
13:00:02:00S: Language with "it_core_news_sm": Done
13:00:02:00S: Language with "pt_core_news_sm": Done
13:00:02:00S: Language with "nl_core_news_sm": Done
13:00:02:00S: Language with "es_core_news_sm": Done
13:00:02:00S: Language with "el_core_news_sm": Done
13:00:02:00S: Language with "fr_core_news_sm": Done
13:00:02:00S: Loaded all models.

You'll notice that while 8 models are indicated as "Loading...", only 6 are marked "Done".

When running from my actual application, (not the test script I provided), I receive the following stack trace:

  File "C:\Python37\lib\site-packages\spacy\__init__.py", line 27, in load
    return util.load_model(name, **overrides)
  File "C:\Python37\lib\site-packages\spacy\util.py", line 134, in load_model
    return load_model_from_package(name, **overrides)
  File "C:\Python37\lib\site-packages\spacy\util.py", line 155, in load_model_from_package
    return cls.load(**overrides)
  File "C:\Python37\lib\site-packages\nl_core_news_sm\__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "C:\Python37\lib\site-packages\spacy\util.py", line 196, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "C:\Python37\lib\site-packages\spacy\util.py", line 179, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\Python37\lib\site-packages\spacy\language.py", line 836, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\Python37\lib\site-packages\spacy\util.py", line 636, in from_disk
    reader(path / key)
  File "C:\Python37\lib\site-packages\spacy\language.py", line 831, in <lambda>
    p, exclude=["vocab"]
  File "pipes.pyx", line 641, in spacy.pipeline.pipes.Tagger.from_disk
  File "C:\Python37\lib\site-packages\spacy\util.py", line 636, in from_disk
    reader(path / key)
  File "pipes.pyx", line 620, in spacy.pipeline.pipes.Tagger.from_disk.load_model
  File "pipes.pyx", line 530, in spacy.pipeline.pipes.Tagger.Model
  File "C:\Python37\lib\site-packages\spacy\_ml.py", line 523, in build_tagger_model
    pretrained_vectors=pretrained_vectors,
  File "C:\Python37\lib\site-packages\spacy\_ml.py", line 361, in Tok2Vec
    (norm | prefix | suffix | shape)
  File "C:\Python37\lib\site-packages\thinc\check.py", line 129, in checker
    raise UndefinedOperatorError(op, instance, args[0], instance._operators)
thinc.exceptions.UndefinedOperatorError:

  Undefined operator: |
  Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x000001A6847EE148>, <thinc.neural._classes.hash_embed.HashEmbed object at 0x000001A68CAF5988>)
  Available:

  Traceback:
  Γö£ΓöÇ <lambda> [831] in C:\Python37\lib\site-packages\spacy\language.py
  Γö£ΓöÇΓöÇΓöÇ from_disk [636] in C:\Python37\lib\site-packages\spacy\util.py
  ΓööΓöÇΓöÇΓöÇΓöÇΓöÇ build_tagger_model [523] in C:\Python37\lib\site-packages\spacy\_ml.py
         >>> pretrained_vectors=pretrained_vectors,

Please would you run the test script to confirm the existence of a problem?

I believe the problem does not manifest with MAX_WORKER=1.

I think this is a duplicate of #3690 (the same issue Matt mentioned above). I can replicate it just loading multiple en models.

~FYI: I just upgraded to 2.2. and this issue no longer appears to be a problem.~
Sorry. I spoke too soon.

No, I tested with 2.2, too. It's clearly not deterministic.

Cross-referencing #3552 here as well, as the two issues should probably be solved together.

I don't think it's a particularly good solution, but I think a quick way to fix this at least temporarily is to make the saved copy of the operators thread-local, too. Otherwise it looks like it's getting clobbered.

https://github.com/explosion/thinc/blob/3f6220f4cc39c715069f2a4ef0795c218262535f/thinc/neural/_classes/model.py#L37-L57

Hey guys,

I am reporting the same issue than @ericmclachlan.

I am running a ML API using gunicorn and a call to SpaCy is made during preprocessing. I am unfortunately facing an error when I want to load SpaCy and setting the gunicorn number of workers to more than one. The following simple code leads to the same result:

Code:

@app.route("/", methods=['GET', 'POST'])
def predictor():
    text = requests.form['text']
    nlp = spacy.load('xx_ent_wiki_sm')
    doc = nlp(text)
    return [str(e) for e in doc.ents if e.label_ == 'PER']

However, this error does not occur anymore when the number of gunicorn workers is set to 1.

Config:

python 3.7.5
gunicorn==19.9.0
spacy==2.2.3
thinc==7.3.1

Error:

[1;38;5;1mUndefined operator: >> [0m
Called by (<thinc.neural._classes.function_layer.FunctionLayer object at 0x3e84bf6cee90>, <thinc.neural._classes.feed_forward.FeedForward object at 0x3e84bf2d9610>)
  Available: 

[1;38;5;4mTraceback: [0m 
├─ [1mfrom_disk [0m [654] in /usr/local/lib/python3.7/site-packages/spacy/util.py 
├─── [1m<lambda> [0m [936] in /usr/local/lib/python3.7/site-packages/spacy/language.py 
└───── [1mTok2Vec [0m [323] in /usr/local/lib/python3.7/site-packages/spacy/_ml.py 

The way I found to get around this is to set the gunicorn workers class to sync.

This is a bug related to multithreading in thinc and should hopefully be fixed soon by this PR: https://github.com/explosion/thinc/pull/124

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings