Spacy: Deploying to Heroku

Created on 24 Mar 2016 · 11Comments · Source: explosion/spaCy

Is it possible to download the data from running spacy.en.download in a particular directory and have spacy use that directory? Would it be bad practice to commit this data to source control?

Background Info:

We are deploying an application to Heroku that relies on Spacy for some NLP work. We are having some trouble getting Spacy to work properly once it is deployed to Heroku because the Spacy data needs to be downloaded. We have tried logging into bash on Heroku by running heroku run bash and then running python -m spacy.en.download all from the remote machine. We tested that this actually installed the data by running a python interpreter and we are able to instantiate English() properly. However, when we use our web app it fails. Heroku logs show that it is failing because Spacy throws an exception saying that we should run python -m spacy.en.download.

Source

vijayv

👍1

Most helpful comment

@perdix This is addressed and explained here: https://github.com/explosion/spaCy/issues/1099#issuecomment-306053749

Jeiwan on 23 Jul 2017

❤3

All 11 comments

You can make sure the data is downloaded programmatically, by adding this inside your launch script:

import sputnik
import spacy.about

package = sputnik.install('spacy', spacy.about.__version__, spacy.about.__default_model__)

This will install the data the same way that the python -m spacy.en.download command does, which executes spacy.en.download.main(), which makes the call above to our data package manager, sputnik.

You can use sputnik to control where the data is installed and where it's loaded from. I'm wondering whether that's the real issue here though? It sounds to me like your problem might be that when your app is launched Heroku has wiped the disk state and started fresh. Hopefully if you can ensure the data is downloaded as part of your launch process, you can get it running.

Be aware that spaCy requires a fair bit of memory --- 2 or 3 gigs. I'm not sure it fits within Heroku's free tier, if that's what you're using.

syllog1sm on 24 Mar 2016

Great, I'll try that.

You are right, this may not be an issue. We just weren't sure how/where to ask for help. I'll close it :-)

vijayv on 25 Mar 2016

I do have the same problem! Can you eleborate what you mean with the launch script? Where to put that exactly?

I am also more than happy if someone else has figured out how to load the models on Heroku or similar?

perdix on 22 Jul 2017

@perdix This is addressed and explained here: https://github.com/explosion/spaCy/issues/1099#issuecomment-306053749

Jeiwan on 23 Jul 2017

❤3

Great, did not find that before. Thanks a lot!

perdix on 23 Jul 2017

love spacy's nlp() to death but would have to run on the very expensive performance tier to load it up . . are there any workarounds or coming developments? What if I only want to use the parser and leave word vectors to another package? I'd love for even a somewhat poor performing parser if it still has both tag_ and dep_ capabilities but didn't need so much RAM at startup. Perhaps I'm asking for a miracle

superarius on 8 Aug 2017

In v1.8 the en_core_web_sm is 50mb to download, while the full model is 500mb. v2 cuts the model down to 15mb, and uses the context to help assign vectors, so the word vectors table is very small. You can install v2 with pip install spacy-nightly

honnibal on 8 Aug 2017

Thank you! v2.0.0 solved my problem (mostly, see below). Love the size scaledown. I was getting memory errors or request timeout errors but now the initial import is good to go on a single Heroku free tier dyno. Am I right that the leanest way to import the model into my script is
like this:

from spacy import load
import en_core_web_sm
nlp= en_core_web_sm.load()

Or do i not even need the first line at all? Also only issue is that now that I'm using 2.0 I'm running into this bug here: https://github.com/explosion/spaCy/issues/1242. But I'm sure you are on it

superarius on 8 Aug 2017

You shouldn't need the first line. And in fact you can specify the model as the dependency in your requirements --- pip will pull in spacy-nightly, as with any other dependency resolution.

honnibal on 8 Aug 2017

Btw you might want to check that numpy has been linked to OpenBLAS on your server. Otherwise v2 will be quite slow. Another efficieny tip is that in spaCy 2, it's pretty important for efficiency to use .pipe() if you have a batch of documents, so that the neural network can group the inputs.

If you're serving a single document per request, hopefully you don't need spaCy to be too fast. The network is usually much slower, anyway.

honnibal on 9 Aug 2017

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.