It would be great if spaCy had functions to list all installed models and to check if a particular model is installed.
My use-case is that if that when the user of my script specifies a model that spacy.load doesn't find, I try to install the model via the cli module. I am using multiprocessing, and since I do not want all processes to download the model, I run this code before I fire up the pool. However, as it is, now I load the model twice (once in the main process and once in the worker processes).
It would be much nicer if I could just check in the main process if the model is installed, download it if not, and only load the model in the worker processes. Alternatively, I would argue that just listing the models available is a useful feature.
I know that the cli module has some functions that do something similar; validate(), for instance lists the models. However, it outputs the results to stdout instead of returning data structures we could work with in Python. I believe that separating the model and the view would make for a better user (programmer) experience.
spaCy models are regular Python packages, so assuming you're not relying on shortcut links (which I wouldn't recommend for your use case), you can use the same methods you'd use to check if any other packages are installed (see this Stack Overflow thread for examples).
spaCy also has an internal utility function is_package:
assert spacy.util.is_package("en_core_web_sm")
assert not spacy.util.is_package("cdkfodfkoskd")
Listing all models is a bit trickier, because you can't necessarily know whether an installed package is a spaCy model or not (especially not if you're working with custom models). One thing you could do is fetch the compatibility table, get the current spaCy version (spacy.__version__) and then check which installed packages are in the keys.
The download command has a helper for this that you could experiment with:
from spacy.cli.download import get_compatibility
compat_current_version = get_compatibility()
I'm interested in this issue, because a lot of end users have problem with spacy models not being linked and installed with the wrong python.
By reading validate.py it looks like it could return the printed variables (as spacy.cli.info). But for a programatic use i think the exits=0 should not exit the program.
I made some functions to list the available models (that can be imported) and the linked one.
I took a different approach from spacy.cli.validate, I scan though the python path looking for spacy models (relying on the meta.json file).
The use case is telling user to link unlinked models.
def list_linked_models():
""" Read SPACY/data and return a dictionnary {model_path: link_name} """
spacy_data = os.path.join(spacy.info(silent=True)['Location'], 'data')
linked = [os.path.join(spacy_data, d) for d in os.listdir(spacy_data)]
linked = {os.readlink(d): os.path.basename(d) for d in linked if os.path.islink(d)}
return linked
def list_available_models():
""" Scan PYTHONPATH to find spacy models """
models = []
# For each directory in PYTHONPATH
paths = [p for p in sys.path if os.path.isdir(p)]
for site_package_dir in paths:
# For each module
modules = [os.path.join(site_package_dir, m) for m in os.listdir(site_package_dir)]
modules = [m for m in modules if os.path.isdir(m)]
for module_dir in modules:
if 'meta.json' in os.listdir(module_dir):
# Ensure the package we're in is a spacy model
meta_path = os.path.join(module_dir, 'meta.json')
with open(meta_path) as f:
meta = json.load(f)
if meta.get('parent_package', '') == 'spacy':
models.append(module_dir)
return models
def get_spacy_models(filt=None):
""" Return a dictionnary {model_path: link_name or None} """
linked = list_linked_models()
models = list_available_models()
# Make `models` as `linked` (a dictionnary {model_path: None})
models = {m: None for m in models}
# Replace None by `link_name` if exists
models.update(linked)
if filt:
# Hack to filter module of a specific lang generally spacy's
# module name begins with alpha2 language code
models = {m: l for m, l in models.items() if os.path.basename(m)[:2] == filt}
return models
This feature is already implemented on develop and will be available in v3 馃檪 You'll be able to call into util.get_installed_models() to get a list of all installed models, and the validate command will also report this information.
Under the hood it works like this: model packages will "advertise" themselves via a Python entry point that spaCy can read from. So even without loading anything, spaCy will know about all packages that expose an entry point for spacy_models. This will work for both official model packages, as well as any custom model packages created with spacy package.
import spacy
spacy.prefer_gpu()
import subprocess
import sys
def install_spacy_required_packages():
packages = ['en', 'en_core_web_sm']
for package_name in packages:
if not spacy.util.is_package(package_name):
subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])
install_spacy_required_packages()
import spacy
spacy.prefer_gpu()
import subprocess, sys, io
def install_spacy_required_packages():
packages = ['en', 'en_core_web_sm']
stdout = sys.stdout
sys.stdout = io.StringIO()
spacy.cli.validate()
spacy_validatio_log = sys.stdout.getvalue()
sys.stdout = stdout
print(spacy_validatio_log)
for package_name in packages:
if not ' ' + package_name + ' ' in spacy_validatio_log:
subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])
install_spacy_required_packages()
Instead of using the command line, you could do something like this:
import spacy
spacy_model_name = 'de_core_news_sm'
if not spacy.util.is_package(spacy_model_name):
spacy.cli.download(spacy_model_name)
nlp = spacy.load(spacy_model_name)
spacy.util.is_package(spacy_model_name)
If i not mistake it causes an exception on "spacy.util.is_package(spacy_model_name)" if not installed.
I resolved this problem according to their official documentation, i added task to requirements.txt
spacy>=2.2.0,<3.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz#egg=en_core_web_sm
Regards.
Most helpful comment
This feature is already implemented on
developand will be available in v3 馃檪 You'll be able to call intoutil.get_installed_models()to get a list of all installed models, and thevalidatecommand will also report this information.Under the hood it works like this: model packages will "advertise" themselves via a Python entry point that spaCy can read from. So even without loading anything, spaCy will know about all packages that expose an entry point for
spacy_models. This will work for both official model packages, as well as any custom model packages created withspacy package.