Spacy: Feature Request: Seamless installation of language models from PyPI

Created on 3 Apr 2019 · 11Comments · Source: explosion/spaCy

I use spaCy as a dependency for my own Python package (in case anyone's interested: lytspel). So I've added 'spacy' to the install_requires section of my setup.py. This basically works, but there are two issues:

(1) When I install my package in a venv (virtual environment), installation is somewhat slow and the venv takes up considerable space, mainly because spaCy itself is so big. lib/python*/site-packages/spacy has 402 MB, nearly all of which is in spacy/lang (386 MB). spacy/lang/en in turn has only 5 MB, and I don't need support for any other language.

(2) After installing spaCy, it's still necessary to download a suitable language model: python3 -m spacy download en. I found out that it's also possible to call this from within a Python script:

from spacy.cli import download
download('en')

But I cannot do this during setup since the wheel format (now standard on PyPI) doesn't allow execution of arbitrary code during installation. So instead, in my script, I check whether spacy.load('en') fails, and if it does, I download the model. This works, but it's clumsy and in some cases there may be permission issues.

So my dream would be that I could configure my dependencies to automatically install spaCy for English, including the language model, but not any other languages. This could be accomplished if spaCy had multiple packages on PyPI:

spacy-core: Everything that's currently part of the spacy package, except for the (huge) lang subdirectory.
spacy-en: Depends on spacy-core and additionally installs what's necessary to use spaCy for English, that is (presumable) the code in lang/en.
spacy-de, spacy-fr etc.: Likewise for other languages.
spacy-en-model: Depends on spacy-en and additionally installs the language model (which currently ends up in en_core_web_sm), making a separate download step unnecessary.
spacy-de-model, spacy-fr-model etc.: Likewise for other languages.

This would make spaCy a much more compact and elegant dependency.

enhancement models

Source

ChristianSi

👍11 🎉3

Most helpful comment

Hi,

I think this is an awesome proposal. It's really odd to install models from outside the usual python module tools.

In my case I'm using pipenv. I would love to do something like celery does for it's "extras":

pip install "celery[redis]"
pip install "celery[librabbitmq,redis,auth,msgpack]"

They are called bundles

From cli:

pip install spacy[es_core_news_md]

In pipenv:

[packages]
spacy = {extras = ["es_core_news_md"],version = "*"}

qcho on 15 Apr 2019

👍7

All 11 comments

Thanks for the suggestions! The package size is definitely on our radar and something we want to work on – see #3258 for details. Zipping the language data (and transitioning away from lookup tables to rule-based and statistical lemmatizers) should hopefully resolve this.

(2) After installing spaCy, it's still necessary to download a suitable language model: python3 -m spacy download en. I found out that it's also possible to call this from within a Python script:
But I cannot do this during setup since the wheel format (now standard on PyPI) doesn't allow execution of arbitrary code during installation.

Is there a specific reason you can't install the model directly via pip and its URL? In a real-life production scenario, you probably wouldn't want to be installing 'en' and instead download and install an exact version of a model. See the section on production use for more details and examples. So in your requirements.txt, you could specify something like:

https://github.com/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm

Model packages can also get very large (our largest one is currently ~1GB, but we might want to distribute larger ones in the future). They also mostly consist of binary data, which makes them pretty unsuitable to upload to PyPi. I'm not even 100% sure PyPi would allow packages like this. This is why we've been providing attached to GitHub releases as pip-installable packages, with a little convenience wrapper around it available as the spacy download command.

This could be accomplished if spaCy had multiple packages on PyPI

While this sounds nice in theory, it'd require us to upload and maintain at least 66 (!) separate Python packages: 49 languages, 16 models and core. Not to mention all the other packages we publish that spaCy or spaCy's dependencies depend on. Even if we managed to set up the 50+ library packages as some kind of monorepo thing, it'd still introduce a massive additional maintenance burden that's just not feasible.

ines on 3 Apr 2019

@ines Thanks for your quick and insightful response! I wasn't aware of #3258. If that one's implemented the size of the package might no longer be an issue. Meanwhile, I also looked more closely at the contents of the lang directory and there are only 11 languages (subdirectories) with more than 5 MB:

MB      lang
70      tr
44      nb
43      pt
38      da
37      sv
30      ca
27      fr
25      es
21      de
17      ro
17      it

Assuming that only the top-5 (tr to sv) of these were moved into separate packages, the lang directory would shrink from 386 to 154 MB. A considerable improvement, and 6 separate packages on PyPI should still be quite manageable. But maybe #3258 is a full solution making this unnecessary.

As for installing the model, the problem is that I want my package to by installable in the usual way, by typing pip[3] install <packagename> -- just like spacy and all the thousands of other packages on PyPI. Hence I cannot use a requirements.txt file, since that would force users to first download that file and then call pip install -r /path/to/requirements.txt instead of pip install <packagename>. Too complicated and too much a deviation from the standard procedure to be feasible. Hence I need to be able to add the model to my setup.py but I haven't found a way to do so. Just listing https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm-2.1.0 in the install_requires argument gives an error when trying to build the wheel ("'install_requires' must be a string or list of strings containing valid project/version requirement specifiers").

There also is a dependency_links argument which should, in theory, work for such cases, but I haven't been able to find out how. If I write

dependency_links = [
    'https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz#egg=en_core_web_sm-2.1.0',
],
install_requires=[
    ...
    'spacy >= 2.0.0',
    'en_core_web_sm >= 2.1.0',
],

I get an error when calling pip install <path-to-wheel>.whl:

Could not find a version that satisfies the requirement en-core-web-sm>=2.1.0 (from lytspel==0.9.1) (from versions: )
No matching distribution found for en-core-web-sm>=2.1.0 (from lytspel==0.9.1)

I also tried omitting the version specifications, but still get "No matching distribution found". If I omit the 'en_core_web_sm' entry in install_requires altogether, the dependency_links entry is simply ignored.

If you or anyone else knows how to make this work, this part of my problem with be solved (without requiring any changes in how spaCy is packaged). But I've been unable to find a solution, and sadly setuptools isn't very well documented when it comes to such stuff.

ChristianSi on 4 Apr 2019

Hi,

I think this is an awesome proposal. It's really odd to install models from outside the usual python module tools.

In my case I'm using pipenv. I would love to do something like celery does for it's "extras":

pip install "celery[redis]"
pip install "celery[librabbitmq,redis,auth,msgpack]"

They are called bundles

From cli:

pip install spacy[es_core_news_md]

In pipenv:

[packages]
spacy = {extras = ["es_core_news_md"],version = "*"}

qcho on 15 Apr 2019

👍7

To follow this up, I meanwhile found out that @ines 's suggestion to add a GitHub-hosted model file as dependency will never work for packages hosted on PyPI since pip forbids it. See the answer to this question which quotes pip's release notes:

As a security measure, pip will raise an exception when installing packages from PyPI if those packages depend on packages not also hosted on PyPI. In the future, PyPI will block uploading packages with such external URL dependencies directly.

I must say that I don't understand this decision -- if you don't trust a package, you should not install it, and if you do trust it, why not allow it to specify its dependencies as needed? But however that is, it seems the decision is final, so I see only two possibilities:

Either the spaCy team makes at least the most important language models available on PyPI. Size should not be an issue at least for the standard 'en' model which has only 10 MB. And I'd think that if, say, the 5 or 10 most popular models were packaged in that way that would already help most people who want to package spaCy-dependent code.

Or else PyPi-installable packages depending on spaCy will forever need cumbersome workarounds (such as the one I used).

ChristianSi on 16 Apr 2019

So, there are basically two issues here:

The large size of the current installation. We're working on this: I want to get rid of all these data files, especially once we have proper statistical lemmatization more consistently. We don't want to fragment into a bunch of packages for this, as it will be really difficult to handle. We just want to remove this data and make the installation smaller.
Packaging the models for pip. I do see the argument for this, but the problem is we can't have all of the models on pip, as some are too large. We're reluctant to have two different mechanisms, but it might be worth it on balance. We're considering it.

honnibal on 22 Apr 2019

👍1

Packaging the models for pip. I do see the argument for this, but the problem is we can't have all of the models on pip, as some are too large. We're reluctant to have two different mechanisms, but it might be worth it on balance. We're considering it.

You can ask PyPi for exception to make file limit there larger. That is what pytorch and tensorflow had to do. Open an issue here and request exception: https://github.com/pypa/packaging-problems

mitar on 20 May 2019

👍2

@mitar Should they give us an exception, though? I'm really not sure they should --- binary data dependencies isn't primarily what pip is for. For Tensorflow and PyTorch, there's no reasonable alternative, so the exception is necessary. I'm not sure that's true here.

honnibal on 1 Jun 2019

You are right. They are probably not bundling large pretrained models with them. Or are they?

mitar on 1 Jun 2019

@honnibal:

For Tensorflow and PyTorch, there's no reasonable alternative, so the exception is necessary. I'm not sure that's true here.

But is there a better alternative? My impression is that, should one exist, it hasn't yet been found.

ChristianSi on 4 Jun 2019

Quick update: For the upcoming v3, we _really_ tried to come up with a solution for this so we could take advantage of some of Python's native features for distributing and managing packages. However, we didn't find a viable solution, mainly for the following reasons:

File size and official PyPI index. The default file size for packages uploaded to PyPI is 60mb. Many of our model packages are going to exceed this limit, many of them even by a lot. Uploading larger packages is possible in theory, but it requires submitting a request for each package (we recently had do this for spacy-lookups-data), that's then manually approved by a PyPA maintainer. This a burden we do not want to impose on the PyPA maintainers, and it's not viable as a future-proof distribution strategy.
Index servers, resolving packages and security concerns. While it'd be fairly easy to set up our own index server synced with the models repo release artifacts, this would open up a whole other can of worms. pip gives you two options to customise the index URL used to download packages from:
- --extra-index-url: This adds another index URL that packages are downloaded from, if they're not available from the default PyPI index. For the models, this would mean that pip would first check if spacy_model_en_core_web_sm is available on PyPI, and if not, use our models index specified via --extra-index-url models.spacy.io or something like that. The problem here is that we'd need to namesquat _all our model package names_ on the public PyPI and make them raise an error on install, telling people to use the correct index. Otherwise, someone else could squat them and thousands of download requests would suddenly be routed to _their_ (potentially malicious) package instead. spaCy is downloaded about 20k-50k times a day, so that's a significant attack vector.
- --index-url: Change the _default_ index URL. This means we'd have to ask all our users to change their default pip index to _our_ models index and use the main PyPI index as an extra. This is very confusing, inconvenient and not something anyone should be doing.

One enhancement proposal that would solve the main problems listed above is namespace support in PyPI, similar to how npm does it. You can read more about the idea here: https://discuss.python.org/t/namespace-support-in-pypi/1609 We could then distribute our models under something like @spacy/en_core_web_sm, via our own index URL, while also having control over the @spacy namespace of the official PyPI index to prevent anyone else from claiming packages under that namespace. However, namespace support is only a proposal and the latest update is this:

As far as I am aware there has been no progress. It would require someone to write and champion a PEP, and then someone to implement it assuming it got accepted.

We'd love to revisit this topic in the future, especially if namespaces were to become an official feature. But for now, I'll close this issue, because there's really nothing we can do, unfortunately.

ines on 24 May 2020

👍1

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.