Right now, my package has spaCy as a requirement in requirements.txt, and python -m spacy download en is run as part of the installation process.
According to the documentation, models can be listed in requirements.txt, but no example is given. How can I add a requirements.txt entry to just install the default English models?
And is this enough, or will I then also need to run python -m spacy link or something?
Thanks – and sorry about the confusion. I agree, this should definitely be more clear!
The standard way of installing packages specified in the requirements.txt assumes they're downloadable via a PyPi server (usually pypi.python.org). While model packages are valid pip packages, they can't be uploaded to theofficial PyPi directory, as they don't meet the requirements (they're too large and consist of mostly binary data). However, a lot of companies run their own internal installations of PyPi – in that case, you can simply upload the model there and point your pip at the internal server.
Alternatively, pip also lets you specify URLs and other sources in the requirements – see here for more info and examples. So instead of only the package name, you can add the URLs of the models you want to install.
This won't run any spaCy internals like download (which is mostly a convenience wrapper for pip's installer) or link. So you'll either have to create the symlink yourself afterwards, or load the model by importing the package and calling its load() method with no arguments:
import en_core_web_sm
nlp = en_core_web_sm.load()
In general, we do recommend this syntax for larger code bases because it doesn't depend on symlinks, and is cleaner and more "native" – for example, if a model package is not installed, Python will raise an ImportError immediately, instead of failing somewhere down the line when calling spacy.load().
So if specifying models in your requirements.txt is useful for your project, there's a high chance that native model imports will actually be more convenient as well. I hope this helps – will definitely add a section about this to the docs as well 👍
TL;DR Adding the model URL instead of the package name to your requirements.txt and importing the model as a package in your code should do the trick.
Thank you, this is very clear!
So is en_core_web_sm the same package I get when I run python -m spacy download en?
Yes, en and all other shortcuts download the default models, usually the most compact ones – in this case en_core_web_sm. (In the list of available models, the default models are the ones marked with a star. Internally, spaCy resolves the shortcuts by looking them up in this table.)
Very clear indeed! Now I'm wondering if there's a simple equivalent for setup.py
We used a call to spacy.en.download in our setup.py to install the required modules, I believe the practice is deprecated or frowned upon.
@lalvarezguillen I think you might be looking for a solution like this: https://stackoverflow.com/a/3481388/6400719
We used a call to spacy.en.download in our setup.py to install the required modules, I believe the practice is deprecated or frowned upon.
In theory, you could still use spacy.cli.download for this (spacy.en.download is deprecated since v1.7). I wouldn't say that this practice is frowned upon, but we definitely wouldn't recommend it for production use. If you know which model your application needs, you shouldn't have to do an additional roundtrip and depend on spaCy's downloader just to fetch and pip install a package from a URL. (This was also part of the reason we decided to publish the models on GitHub and not just route all requests via our server. Especially since there's not just one "the model" anymore, but several different ones for different languages and use cases.)
Btw, in spaCy v2.x, another option could be to simply package the models with your application. The new alpha models are only 12 and 15 MB – about the size of the spaCy package, and probably smaller than many other random pip packages.
Edit: Just to clarify, this approach would be mostly for internal production use – not if you're actually distributing your package on PyPi or GitHub. While the model licenses (CC BY-SA) allow redistribution, we don't want to encourage people to reupload and mirror the official spaCy models. After all, they're just binary data and we want to make sure that there's only one official distribution. This makes things safer and less confusing for everyone.
Addressed in 7c4bf99 and live here!
For python newbies like me. To add a model to Pipfile:
[packages]
spacy = "*"
de_core_news_sm = { file = 'https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-2.0.0/de_core_news_sm-2.0.0.tar.gz' }
Not sure why but I added the model to my Pipfile, updated the lock file, but spacy doesn't appear to be working. Right now my Pipfile looks like this:
[packages]
spacy = "*"
gunicorn = "*"
flask = "*"
"en_core_web_sm" = {file = "https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-1.2.0/en_core_web_sm-1.2.0.tar.gz"}
and my import and package loading looks like this:
import en_core_web_sm
print("Loading spacy...")
nlp = en_core_web_sm.load()
print(nlp)
print("Spacy loaded.")
my print statements look like this:
11:45:02 web.1 | Loading spacy something...
11:45:03 web.1 | <spacy.lang.en.English object at 0x10d676e80>
11:45:03 web.1 | Spacy loaded.
but when I actually process text or do anything with the nlp object...nothing happens. It might be tokenizing the text but not much else. If I pass text in with doc = nlp(text) and run print(doc) I get the text back. But so far any attempts at looking at doc.ents have failed. Printing doc.ents returns an empty set. I should mention that this whole thing works not through heroku. If I run it in the local environment using python app.py it fires up no problem and processes text. However when I run heroku local web or git push heroku master I get diddly, despite the fact it appears to be loading the spacy model. Any ideas as to what I'm doing wrong?
(Apologies if this is in the wrong place or I should have made a new issue. If so let me know and I'll do so.)
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
Thanks – and sorry about the confusion. I agree, this should definitely be more clear!
The standard way of installing packages specified in the
requirements.txtassumes they're downloadable via a PyPi server (usually pypi.python.org). While model packages are valid pip packages, they can't be uploaded to theofficial PyPi directory, as they don't meet the requirements (they're too large and consist of mostly binary data). However, a lot of companies run their own internal installations of PyPi – in that case, you can simply upload the model there and point your pip at the internal server.Alternatively, pip also lets you specify URLs and other sources in the requirements – see here for more info and examples. So instead of only the package name, you can add the URLs of the models you want to install.
This won't run any spaCy internals like
download(which is mostly a convenience wrapper for pip's installer) orlink. So you'll either have to create the symlink yourself afterwards, or load the model by importing the package and calling itsload()method with no arguments:In general, we do recommend this syntax for larger code bases because it doesn't depend on symlinks, and is cleaner and more "native" – for example, if a model package is not installed, Python will raise an
ImportErrorimmediately, instead of failing somewhere down the line when callingspacy.load().So if specifying models in your
requirements.txtis useful for your project, there's a high chance that native model imports will actually be more convenient as well. I hope this helps – will definitely add a section about this to the docs as well 👍TL;DR Adding the model URL instead of the package name to your
requirements.txtand importing the model as a package in your code should do the trick.