Spacy: Function to list installed models or to check if a model is installed

Created on 5 Nov 2019  路  7Comments  路  Source: explosion/spaCy

Feature description

It would be great if spaCy had functions to list all installed models and to check if a particular model is installed.

My use-case is that if that when the user of my script specifies a model that spacy.load doesn't find, I try to install the model via the cli module. I am using multiprocessing, and since I do not want all processes to download the model, I run this code before I fire up the pool. However, as it is, now I load the model twice (once in the main process and once in the worker processes).

It would be much nicer if I could just check in the main process if the model is installed, download it if not, and only load the model in the worker processes. Alternatively, I would argue that just listing the models available is a useful feature.

I know that the cli module has some functions that do something similar; validate(), for instance lists the models. However, it outputs the results to stdout instead of returning data structures we could work with in Python. I believe that separating the model and the view would make for a better user (programmer) experience.

enhancement models usage

Most helpful comment

This feature is already implemented on develop and will be available in v3 馃檪 You'll be able to call into util.get_installed_models() to get a list of all installed models, and the validate command will also report this information.

Under the hood it works like this: model packages will "advertise" themselves via a Python entry point that spaCy can read from. So even without loading anything, spaCy will know about all packages that expose an entry point for spacy_models. This will work for both official model packages, as well as any custom model packages created with spacy package.

All 7 comments

spaCy models are regular Python packages, so assuming you're not relying on shortcut links (which I wouldn't recommend for your use case), you can use the same methods you'd use to check if any other packages are installed (see this Stack Overflow thread for examples).

spaCy also has an internal utility function is_package:

assert spacy.util.is_package("en_core_web_sm")
assert not spacy.util.is_package("cdkfodfkoskd")

Listing all models is a bit trickier, because you can't necessarily know whether an installed package is a spaCy model or not (especially not if you're working with custom models). One thing you could do is fetch the compatibility table, get the current spaCy version (spacy.__version__) and then check which installed packages are in the keys.

The download command has a helper for this that you could experiment with:

from spacy.cli.download import get_compatibility
compat_current_version = get_compatibility()

I'm interested in this issue, because a lot of end users have problem with spacy models not being linked and installed with the wrong python.

By reading validate.py it looks like it could return the printed variables (as spacy.cli.info). But for a programatic use i think the exits=0 should not exit the program.

I made some functions to list the available models (that can be imported) and the linked one.
I took a different approach from spacy.cli.validate, I scan though the python path looking for spacy models (relying on the meta.json file).

The use case is telling user to link unlinked models.

def list_linked_models():
    """ Read SPACY/data and return a dictionnary {model_path: link_name} """
    spacy_data = os.path.join(spacy.info(silent=True)['Location'], 'data')
    linked = [os.path.join(spacy_data, d) for d in os.listdir(spacy_data)]
    linked = {os.readlink(d): os.path.basename(d) for d in linked if os.path.islink(d)}
    return linked


def list_available_models():
    """ Scan PYTHONPATH to find spacy models """
    models = []
    # For each directory in PYTHONPATH
    paths = [p for p in sys.path if os.path.isdir(p)]
    for site_package_dir in paths:
        # For each module
        modules = [os.path.join(site_package_dir, m) for m in os.listdir(site_package_dir)]
        modules = [m for m in modules if os.path.isdir(m)]
        for module_dir in modules:
            if 'meta.json' in os.listdir(module_dir):
                # Ensure the package we're in is a spacy model
                meta_path = os.path.join(module_dir, 'meta.json')
                with open(meta_path) as f:
                    meta = json.load(f)
                if meta.get('parent_package', '') == 'spacy':
                    models.append(module_dir)
    return models


def get_spacy_models(filt=None):
    """ Return a dictionnary {model_path: link_name or None} """
    linked = list_linked_models()
    models = list_available_models()
    # Make `models` as `linked` (a dictionnary {model_path: None})
    models = {m: None for m in models}
    # Replace None by `link_name` if exists
    models.update(linked)
    if filt:
        # Hack to filter module of a specific lang generally spacy's
        # module name begins with alpha2 language code
        models = {m: l for m, l in models.items() if os.path.basename(m)[:2] == filt}
    return models

This feature is already implemented on develop and will be available in v3 馃檪 You'll be able to call into util.get_installed_models() to get a list of all installed models, and the validate command will also report this information.

Under the hood it works like this: model packages will "advertise" themselves via a Python entry point that spaCy can read from. So even without loading anything, spaCy will know about all packages that expose an entry point for spacy_models. This will work for both official model packages, as well as any custom model packages created with spacy package.

import spacy
spacy.prefer_gpu()

import subprocess
import sys

def install_spacy_required_packages():
    packages = ['en', 'en_core_web_sm']

    for package_name in packages:
        if not spacy.util.is_package(package_name):
            subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])

install_spacy_required_packages()
import spacy
spacy.prefer_gpu()
import subprocess, sys, io

def install_spacy_required_packages():
    packages = ['en', 'en_core_web_sm']

    stdout = sys.stdout
    sys.stdout = io.StringIO()
    spacy.cli.validate()
    spacy_validatio_log = sys.stdout.getvalue()
    sys.stdout = stdout

    print(spacy_validatio_log)

    for package_name in packages:
        if not ' ' + package_name + ' ' in spacy_validatio_log:
            subprocess.check_call([sys.executable, "-m", "spacy", "download", package_name])

install_spacy_required_packages()

Instead of using the command line, you could do something like this:

import spacy

spacy_model_name = 'de_core_news_sm'
if not spacy.util.is_package(spacy_model_name):
    spacy.cli.download(spacy_model_name)
nlp = spacy.load(spacy_model_name)

spacy.util.is_package(spacy_model_name)

If i not mistake it causes an exception on "spacy.util.is_package(spacy_model_name)" if not installed.
I resolved this problem according to their official documentation, i added task to requirements.txt

spacy>=2.2.0,<3.0.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz#egg=en_core_web_sm

Regards.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

TropComplique picture TropComplique  路  3Comments

armsp picture armsp  路  3Comments

bebelbop picture bebelbop  路  3Comments

peterroelants picture peterroelants  路  3Comments

ines picture ines  路  3Comments