Is it possible to download the data from running spacy.en.download in a particular directory and have spacy use that directory? Would it be bad practice to commit this data to source control?
We are deploying an application to Heroku that relies on Spacy for some NLP work. We are having some trouble getting Spacy to work properly once it is deployed to Heroku because the Spacy data needs to be downloaded. We have tried logging into bash on Heroku by running heroku run bash and then running python -m spacy.en.download all from the remote machine. We tested that this actually installed the data by running a python interpreter and we are able to instantiate English() properly. However, when we use our web app it fails. Heroku logs show that it is failing because Spacy throws an exception saying that we should run python -m spacy.en.download.
You can make sure the data is downloaded programmatically, by adding this inside your launch script:
import sputnik
import spacy.about
package = sputnik.install('spacy', spacy.about.__version__, spacy.about.__default_model__)
This will install the data the same way that the python -m spacy.en.download command does, which executes spacy.en.download.main(), which makes the call above to our data package manager, sputnik.
You can use sputnik to control where the data is installed and where it's loaded from. I'm wondering whether that's the real issue here though? It sounds to me like your problem might be that when your app is launched Heroku has wiped the disk state and started fresh. Hopefully if you can ensure the data is downloaded as part of your launch process, you can get it running.
Be aware that spaCy requires a fair bit of memory --- 2 or 3 gigs. I'm not sure it fits within Heroku's free tier, if that's what you're using.
Great, I'll try that.
You are right, this may not be an issue. We just weren't sure how/where to ask for help. I'll close it :-)
I do have the same problem! Can you eleborate what you mean with the launch script? Where to put that exactly?
I am also more than happy if someone else has figured out how to load the models on Heroku or similar?
@perdix This is addressed and explained here: https://github.com/explosion/spaCy/issues/1099#issuecomment-306053749
Great, did not find that before. Thanks a lot!
love spacy's nlp() to death but would have to run on the very expensive performance tier to load it up . . are there any workarounds or coming developments? What if I only want to use the parser and leave word vectors to another package? I'd love for even a somewhat poor performing parser if it still has both tag_ and dep_ capabilities but didn't need so much RAM at startup. Perhaps I'm asking for a miracle
In v1.8 the en_core_web_sm is 50mb to download, while the full model is 500mb. v2 cuts the model down to 15mb, and uses the context to help assign vectors, so the word vectors table is very small. You can install v2 with pip install spacy-nightly
Thank you! v2.0.0 solved my problem (mostly, see below). Love the size scaledown. I was getting memory errors or request timeout errors but now the initial import is good to go on a single Heroku free tier dyno. Am I right that the leanest way to import the model into my script is
like this:
from spacy import load
import en_core_web_sm
nlp= en_core_web_sm.load()
Or do i not even need the first line at all? Also only issue is that now that I'm using 2.0 I'm running into this bug here: https://github.com/explosion/spaCy/issues/1242. But I'm sure you are on it
You shouldn't need the first line. And in fact you can specify the model as the dependency in your requirements --- pip will pull in spacy-nightly, as with any other dependency resolution.
Btw you might want to check that numpy has been linked to OpenBLAS on your server. Otherwise v2 will be quite slow. Another efficieny tip is that in spaCy 2, it's pretty important for efficiency to use .pipe() if you have a batch of documents, so that the neural network can group the inputs.
If you're serving a single document per request, hopefully you don't need spaCy to be too fast. The network is usually much slower, anyway.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Most helpful comment
@perdix This is addressed and explained here: https://github.com/explosion/spaCy/issues/1099#issuecomment-306053749