Spacy: Error Using SpaCy in Async Threads ([E050] Can't find model 'en_core_web_md.vectors')

Created on 22 Apr 2020  路  15Comments  路  Source: explosion/spaCy

The Problem

I am loading the en_core_web_md spaCy model in a main thread and passing it as argument to async threads. Then I get the error message OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory. when I try doc=nlp(model) in one of those threads.

As far as I'm concerned spaCy should be thread safe, and this error message only occurs when using models that use word vectors. Indeed, the error does not arise when using the en_core_web_sm model. But it persists even when loading the en_core_web_md or en_core_web_lg models with the vectors=False parameter set.

Code Example:

spacy_nlp =  spacy.load("en_core_web_md", vectors=False)
for i, file in enumerate(files):
        application.pool.apply_async(
            my_function,
            args=[file, i, spacy_nlp],
            kwds=configs,
            callback=success_callback_factory,
            error_callback=error_callback_factory)

Observations

This error is not present when I try to use en_core_web_sm (which has no word vectors).
However, it still occurs when I load the model with nlp = spacy.load(spacy_model,vectors=False).
I get the same problem when trying to use the large model.

Environment

  • Operating System: Windows 10
  • Python Version Used: 3.7.6
  • spaCy Version Used: ~(2.2.3)~ 2.2.4 (edited: the problem persists even after upgrading spaCy)
feat / vectors scaling usage windows

Most helpful comment

In terms of the speed, spawn in the problem more than spacy. There are some known issues with the vocab/vectors (which you just solved), but otherwise it's just very slow to start child processes with spawn, which is the only option for windows. In linux the default is fork, which is much faster to start and doesn't have the vectors issues because more of the global state is shared with the child processes. One the child processes have started, I think the differences between fork and spawn may not be that large, though. Try it out with a longer-running scenario to see how well it works?

nlp.pipe() will be much faster if you're processing multiple texts in one request. If it's just one text at a time, then nlp() won't be any different from nlp.pipe(), though. It depends on the model size and how long your texts / batches are, so I'd run some timing tests there, too, to see what works best.

For the global vector state issue, I don't think there's going to be a better solution in spacy v2 than the load_nlp workaround above, but there should be improvements in v3.

All 15 comments

Hi, I think this has been fixed in v2.2.4 (in #5081). Can you try upgrading to see if this fixes the problem?

Hi @adrianeboyd, I've just updated my spaCy to v2.2.4 (using pip because the v2.2.4 doesn't seem to be available at Conda yet) and I am still having that issue (both with vectors turned True and False).

Hmm, can you provide a full example that shows the error? (What is application?)

Sure.
The application part is because we are running a Flask application, then application = Flask(__name__). And spaCy is a part of an endpoint @application.route('/extract_spacy_entities', methods=['POST']). What I am trying to do here is actually deploy a service that uses spaCy for NER.
Something in the lines of:

@application.route('/extract_spacy_entities', methods=['POST'])
def extract_spacy_entities():

    folder = request.form.get('folder', '')

    """
    Some data-extraction related code ...
    """  

    spacy_nlp =  spacy.load("en_core_web_md", vectors=False)
    for i, file in enumerate(files):
            application.pool.apply_async(
                my_function,
                args=[file, i, spacy_nlp],
                kwds=configs,
                callback=success_callback_factory,
                error_callback=error_callback_factory)

    return make_response(
        jsonify({
            'message': constant_strings.SUCCESS_MESSAGE,
            'file_keys': files,
            'output_path': get_output_folder(folder)}),
        200)

Could you check whether you have problems running a simpler example in the same environment?

import spacy
nlp = spacy.load("en_core_web_md")
texts = ["This is a sentence."] * 100
for doc in nlp.pipe(texts, n_process=2):
    print(doc[0].vector[:4])

If it doesn't work, could you also try installing spacy with pip in a new, clean virtual environment? It's possible between conda and pip that the upgrade might not have run cleanly.

If this example works and the flask application still doesn't work, then I'm not sure exactly what's going on and it'd be helpful to see a minimal working flask example so it's easier for us to debug.

@adrianeboyd I've created a new clean conda environment and installed spacy and the en_core_web_md model from pip, all from scratch. Your example does not work. Here is the output I get from it:

(spacy_clean) 位 python test.py
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\gsevrodrigues\Desktop\spacyTest\test.py", line 4, in <module>
    for doc in nlp.pipe(texts, n_process=2):
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 819, in pipe
    for doc in docs:
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 865, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\popen_spawn_win32.py", line 46, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError:
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    for doc in nlp.pipe(texts, n_process=2):
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 819, in pipe
    for doc in docs:
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 865, in _multiprocessing_pipe
    proc.start()
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\gsevrodrigues\Anaconda3\envs\spacy_clean\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
BrokenPipeError: [Errno 32] Broken pipe

Checking the spaCy version at Python in the same environment:

(spacy_clean) 位 python                                                                                
Python 3.7.7 (default, Apr 15 2020, 05:09:04) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32  
Type "help", "copyright", "credits" or "license" for more information.                                
>>> import spacy                                                                                                                                   
>>> spacy.__version__                                                                                 
'2.2.4'                 

Sorry, I didn't test the example as a standalone script. If you run it this way, you need to follow the structure mentioned in the error message:

import spacy

if __name__ == '__main__':
    nlp = spacy.load("en_core_web_md")
    texts = ["This is a sentence."] * 100
    for doc in nlp.pipe(texts, n_process=2):
        print(doc[0].vector[:4])

@adrianeboyd Thank you.

I managed to find an example that works and one that replicates the error using flask applications.

This code (based on your example) works just fine:

from flask import Flask, make_response, request, jsonify
import spacy

application = Flask(__name__)

@application.route('/spacy_test', methods=['GET'])
def extract_spacy_entities():
    nlp = spacy.load("en_core_web_md")
    texts = ["This is a sentence."] * 100
    for doc in nlp.pipe(texts, n_process=2):
        print(doc[0].vector[:4])
    return make_response("ok", 200)

if __name__ == '__main__':
    application.run()

The following code does not work, and provoked the same error I am experiencing on my system:

import os
import sys
import traceback
from multiprocessing import Pool
from flask import Flask, make_response, request, jsonify
import spacy

application = Flask(__name__)

def format_exception():
    exc_type, exc_value, exc_tb = sys.exc_info()
    return ''.join(traceback.format_exception(exc_type, exc_value, exc_tb))

def extract_print(text, spacy_nlp):
    try:
        doc = spacy_nlp(text)
        print(doc[0].vector[:4])
    except Exception as e:
        with open("error.txt", 'w') as f:
            error = format_exception()
            f.write(error)

@application.route('/spacy_test', methods=['GET'])
def extract_spacy_entities():
    application.pool = Pool(1)

    nlp = spacy.load("en_core_web_md")
    texts = ["This is a sentence."] * 100

    for text in texts:
        application.pool.apply_async(
            extract_print,
            args=[text, nlp])

    return make_response("ok", 200)

if __name__ == '__main__':
    application.run()

(I do receive the "ok" response with code 200, but there are errors at extract_print, which are caught as exceptions and printed at the error.txt file)

Here follows the exception error message saved at error.txt:

Traceback (most recent call last):
  File "C:\Users\gsevrodrigues\Desktop\spacyTest\test_service_async.py", line 18, in extract_print
    doc = spacy_nlp(text)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\language.py", line 439, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.__call__
  File "pipes.pyx", line 415, in spacy.pipeline.pipes.Tagger.predict
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\model.py", line 167, in __call__
    return self.predict(x)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\feed_forward.py", line 40, in predict
    X = layer(X)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\model.py", line 167, in __call__
    return self.predict(x)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 310, in predict
    X = layer(layer.ops.flatten(seqs_in, pad=pad))
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\model.py", line 167, in __call__
    return self.predict(x)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\feed_forward.py", line 40, in predict
    X = layer(X)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\model.py", line 167, in __call__
    return self.predict(x)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\model.py", line 131, in predict
    y, _ = self.begin_update(X, drop=None)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 379, in uniqued_fwd
    Y_uniq, bp_Y_uniq = layer.begin_update(X_uniq, drop=drop)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\feed_forward.py", line 46, in begin_update
    X, inc_layer_grad = layer.begin_update(X, drop=drop)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in begin_update
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 163, in <listcomp>
    values = [fwd(X, *a, **k) for fwd in forward]
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\api.py", line 256, in wrap
    output = func(*args, **kwargs)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\static_vectors.py", line 60, in begin_update
    vector_table = self.get_vectors()
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\neural\_classes\static_vectors.py", line 55, in get_vectors
    return get_vectors(self.ops, self.lang)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\extra\load_nlp.py", line 26, in get_vectors
    nlp = get_spacy(lang)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\thinc\extra\load_nlp.py", line 14, in get_spacy
    SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\__init__.py", line 30, in load
    return util.load_model(name, **overrides)
  File "C:\Users\gsevrodrigues\AppData\Roaming\Python\Python37\site-packages\spacy\util.py", line 169, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_md.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Ah, I realized what I overlooked initially. We fixed this for nlp.pipe(), but if you're using nlp with your own multiprocessing that uses spawn, it's still not going to work. You'll need to basically do the same thing as in that patch in your own method: pass load_nlp.VECTORS and restore it in the method with load_nlp.VECTORS = vectors:

https://github.com/explosion/spaCy/pull/5081/files

(Be aware that multiprocessing with spawn and larger spacy models is probably going to be rather slow.)

Thank you for your support on that issue @adrianeboyd.

The workaround with load_nlp.VECTORS worked, now the threads are extracting entities with no apparent issues.
But you are right regarding the performance, even with a medium model the multiprocessing with spawn is far from being the most efficient.

Do you think this bug shall be dealt with on the short term? Using nlp.pipe() sure is an option but it does not quite fits perfectly on the architecture I was thinking of for our project.

In terms of the speed, spawn in the problem more than spacy. There are some known issues with the vocab/vectors (which you just solved), but otherwise it's just very slow to start child processes with spawn, which is the only option for windows. In linux the default is fork, which is much faster to start and doesn't have the vectors issues because more of the global state is shared with the child processes. One the child processes have started, I think the differences between fork and spawn may not be that large, though. Try it out with a longer-running scenario to see how well it works?

nlp.pipe() will be much faster if you're processing multiple texts in one request. If it's just one text at a time, then nlp() won't be any different from nlp.pipe(), though. It depends on the model size and how long your texts / batches are, so I'd run some timing tests there, too, to see what works best.

For the global vector state issue, I don't think there's going to be a better solution in spacy v2 than the load_nlp workaround above, but there should be improvements in v3.

Hi I am using en_core_web_lg 2.3.1. got the same problem. Can you share with me your solutions?
Thanks,

Hi I am using en_core_web_lg 2.3.1. got the same problem. Can you share with me your solutions?
Thanks,

Hey, what worked for me was that solution from @adrianeboyd:

Ah, I realized what I overlooked initially. We fixed this for nlp.pipe(), but if you're using nlp with your own multiprocessing that uses spawn, it's still not going to work. You'll need to basically do the same thing as in that patch in your own method: pass load_nlp.VECTORS and restore it in the method with load_nlp.VECTORS = vectors:

https://github.com/explosion/spaCy/pull/5081/files

(Be aware that multiprocessing with spawn and larger spacy models is probably going to be rather slow.)

Basically I had to pass load_nlp.VECTORS from the original nlp load place to the new spawned function, and restore it in the method with load_nlp.VECTORS = vectors. The linked code shows how to do it.

Can you share your code? I am not very sure about the "spawned function".

We're facing this issue when running spaCy inside a FastAPI web service. Since the multiprocessing is happening under the hood of FastAPI, implementing the above workaround isn't straightforward. Any advice? And is this issue resolved in spaCy 3.0? Thanks.

Was this page helpful?
0 / 5 - 0 ratings