Spacy: Downloading a model has failed

Created on 23 Aug 2016  Â·  47Comments  Â·  Source: explosion/spaCy

I'm having the exact same problem as #471. Maybe DNS got messed up again?
Thanks,

install

Most helpful comment

Thanks for your work on this. This is the unfun part of software development, and we don't forget you're doing this as a public service.

All 47 comments

ping index.spacy.io
ping: cannot resolve index.spacy.io: Unknown host

I'm having the same issue.

Same here. It was working last night, but I couldn't complete the download then for unrelated reasons. Tonight it's down.

Looks like a DNS issue:

25.0% No such domain (NXDOMAIN) at ns-416.awsdns-52.com (205.251.193.160)
While querying sputnik-production.elasticbeanstalk.com/IN/a
25.0% No such domain (NXDOMAIN) at ns-846.awsdns-41.net (205.251.195.78)
While querying sputnik-production.elasticbeanstalk.com/IN/a
25.0% No such domain (NXDOMAIN) at ns-1235.awsdns-26.org (205.251.196.211)
While querying sputnik-production.elasticbeanstalk.com/IN/a
25.0% No such domain (NXDOMAIN) at ns-1537.awsdns-00.co.uk (205.251.198.1)
While querying sputnik-production.elasticbeanstalk.com/IN/a

Damn. Sorry about this folks.

We need to get a new s3 bucket up, with the data models. Unfortunately I don't have a dump of the old s3 bucket, and don't have access to the AWS account. So this may take some forensics. I would appreciate help from this, if you have time.

Currently we need:

1) A workaround, so that even while index.spacy.io is down, there's an easy way to install data from disk. This may take some digging around in the sputnik source code.

2) Piece the data back together, in the form it was originally on the s3 bucket (compressed in the right way, with the right file names)

3) Upload the data to an s3 bucket, repoint index.spacy.io to it.

4) Documentation of the solution

5) Appointment of someone else to help manage the data asset service.

Number 4 and 5 are not immediately time-critical, but part of a larger agenda of improving the project's "bus factor" — we need to get a better set up, so that if something goes down or there's a problem and I'm, say, asleep or away, other people are able to resolve the problem.

Current forensics:

The python -m spacy.en.download script ends up executing the following two calls (variable values fished out of spacy/about.py, if you're retracing this yourself)

package = sputnik.install('spacy', '0.101.0', 'en>=1.1.0,<1.2.0')
sputnik.package('spacy', '0.101.0', 'en>=1.1.0,<1.2.0')

Implementations:

sputnik.install: https://github.com/spacy-io/sputnik/blob/master/sputnik/__init__.py#L16
sputnik.package: https://github.com/spacy-io/sputnik/blob/master/sputnik/__init__.py#L151

Okay so sputnik.package should be called sputnik.get_package. This function resolves a name string to a resource you've installed locally, and wraps it in a sputnik.package.Package object. The job of sputnik.install is to do the actual installation. It both downloads the data, and places it in a directory where sputnik.package will expect to find it.

To complete the trace of how this all works:

  • You call nlp = spacy.load('en'), implemented here: https://github.com/spacy-io/spaCy/blob/master/spacy/__init__.py#L13
  • This calls spacy.util.get_package_by_name('en'), implemented here: https://github.com/spacy-io/spaCy/blob/master/spacy/util.py#L38
  • This results in a call to sputnik.package('spacy', '0.101.0', 'en>=1.1.0,<1.2.0'), implemented here: https://github.com/spacy-io/sputnik/blob/master/sputnik/__init__.py#L151
  • This creates a sputnik.pool.Pool object, implemented here: https://github.com/spacy-io/sputnik/blob/master/sputnik/pool.py
  • The Pool object then creates a Package object via a call pool.get('en>=1.1.0<1.2.0'), implemented on the parent class PackageList here: https://github.com/spacy-io/sputnik/blob/master/sputnik/package_list.py#L54
  • The PackageList.get() method first calls PackageList.find(self, 'en'), in order to determine whether there are any candidates at all. If at least one candidate is found, it goes on to call PackageList.find(self, 'en>=1.1.0<1.2.0'). It then sorts the resulting list. So, what's being sorted here, and how?
  • PackageList.find() iterates over the contents of the PackageList._packages member, added during PackageList.load(), which is called by PackageList.__init__. This iterates over the values returned by PackageList.packages(). The yield of this is given by Line 47: yield self.__class__.package_class(path=os.path.join(self.path, path)). It turns out that the relevant class member is hard-coded to be PackageStub.
  • Inside this find method, we're iterating over instances of sputnik.package.PackageStub. Any instances that match the version constraints will be returned. We then sort this list, in order to return the maximum. The comparison methods can be found here: https://github.com/spacy-io/sputnik/blob/master/sputnik/package_stub.py#L91 . This turns out to call into the semver library, which reads and interprets the versioning strings.
  • In summary: the sputnik.package() function returns a sputnik.package_stub.PackageStub instance. Version constraints are consulted, and the highest matching version is returned. The candidates considered for the matching are produced from a directory listing.

I'm still working out the details of how the data path works, for the local directory.

Thanks for your work on this. This is the unfun part of software development, and we don't forget you're doing this as a public service.

+1 to above. We are _very much_ looking forward to this becoming functional again - spaCy is so much faster than NLTK!

@honnibal do you know the name of the old S3 bucket? If it was accessible via http, there could be something done about it.

Also, why not just use sputnik to create a new package to repopulate a new S3 bucket? I've used sputnik before with my own S3 bucket before as a test of whether we could use it to manage our own data in a semver style system.

+1. This would be great. Is there an ETA for this?

For anyone experiencing this, I have a temporary workaround. I have pulled the downloaded english models from my local Spacy installation. You can download them below. To use the spacy model data, navigate to wherever Spacy is installed. Unzip the folder I supplied and drag the data folder into your Spacy directory.

http://www.filedropper.com/data2

Thank you so much, @RobertChristopher !

@RobertChristopher , I can't unzip the zip file. Does anyone else have this issue as well?
Thanks anyway

Current status:

I've collected most of the models, and have initialised an s3 bucket named index.spacy.io. I've created a CNAME record mapping index.spacy.io to index.spacy.io.s3.amazonaws.com. I've temporarily set full permissions, and I've verified that I can write to it via aws s3 cp.

I'm now trying to upload packages as follows:

sputnik --name spacy --data-path /tmp upload spacy_models/en-1.1.0/

But this command is failing --- I'm getting a 403 error. Here's the part of the traceback within Sputnik:

Traceback (most recent call last):
  File "sputnik/.env/bin/sputnik", line 9, in <module>
    load_entry_point('sputnik==0.9.3', 'console_scripts', 'sputnik')()
  File "sputnik/sputnik/__main__.py", line 12, in main
    args.run(args)
  File "sputnik/sputnik/cli.py", line 124, in run
    repository_url=args.repository_url)
  File "sputnik/sputnik/__init__.py", line 134, in upload
    return index.upload(expand_path(package_path))
  File "sputnik/sputnik/index.py", line 44, in upload
    response = session.open(request, 'utf8')
  File "sputnik/sputnik/session.py", line 43, in open
    r = self.opener.open(request)

@marcotcr I'm sorry about that. Just pushed a fix, should work now. Check the new download link.

@honnibal ah its a bit complex. If you look at the upload function, it calls an endpoint on a server, which gives it the name of an S3 bucket that it uses boto to upload to.

https://github.com/spacy-io/sputnik/blob/master/sputnik/index.py#L33

Two things can be done - either bypass that code, and upload for now and manually recreate the index. Or, find the code to the sputnik index server and run it somewhere.. I'm not sure where that code is located, which is why I didn't end up using sputnik to manage data.

@viksit Aaah, this was dumb. I don't know how I missed this. Okay published the server repository:

https://github.com/spacy-io/sputnik-server

Hey !
I am trying to download the corpus for the first time and am having an urllib2.HTTPError: HTTP Error 403: Forbidden. Is it also related to the aforementioned problem ?

Looking forward to test it :)

Note : Thanks a lot @RobertChristopher

Quick status report:

I've pointed the index.spacy.io subdomain to my server, which will run the REST service. Currently sorting out some Python 2/3 and SSL administrivia. Fingers crossed I'll be able to start uploading today.

Hm. This is expecting a DynamoDB service, which I haven't set up yet. Looking into this.

The following should work now.

sputnik --name spacy --repository-url http://index.spacy.io install en

I still have an https problem, which is why the repository URL needs to be specified --- to force http. I also have to upload much of the data back to the server. But for now this should work.

Hey, I'm pretty new to using spacy in general so this might be a separate issue or something but when I try to run this latest fix

sputnik --name spacy --repository-url http://index.spacy.io install en

I end up with this error

Traceback (most recent call last):
  File "/usr/local/bin/sputnik", line 9, in <module>
    load_entry_point('sputnik==0.9.3', 'console_scripts', 'sputnik')()
  File "build/bdist.linux-x86_64/egg/sputnik/__main__.py", line 12, in main
  File "build/bdist.linux-x86_64/egg/sputnik/cli.py", line 45, in run
  File "build/bdist.linux-x86_64/egg/sputnik/__init__.py", line 37, in install
  File "build/bdist.linux-x86_64/egg/sputnik/index.py", line 103, in update
  File "build/bdist.linux-x86_64/egg/sputnik/cache.py", line 38, in update
AssertionError

Thanks for everything so far!

Hm. Try now?

Different error now at least!

Traceback (most recent call last):
  File "/usr/local/bin/sputnik", line 9, in <module>
    load_entry_point('sputnik==0.9.3', 'console_scripts', 'sputnik')()
  File "build/bdist.linux-x86_64/egg/sputnik/__main__.py", line 12, in main
  File "build/bdist.linux-x86_64/egg/sputnik/cli.py", line 45, in run
  File "build/bdist.linux-x86_64/egg/sputnik/__init__.py", line 44, in install
  File "build/bdist.linux-x86_64/egg/sputnik/cache.py", line 57, in fetch
  File "build/bdist.linux-x86_64/egg/sputnik/package_list.py", line 59, in get
  File "build/bdist.linux-x86_64/egg/sputnik/package_list.py", line 68, in find
  File "build/bdist.linux-x86_64/egg/sputnik/util.py", line 47, in constraint_match
sputnik.util.InvalidConstraintException: invalid constraint: -1.1.0

Sputnik did not work for me, but I was able to download en with
python -m spacy.en.download --force
Still,
sputnik --name spacy install en_glove_cc_300_1m_vectors-1.0.0
does not work.
Thanks!

python -m spacy.en.download --force also worked for me, thanks!

O_o
Not sure what's going with the --force command, but okay!

I don't have a copy of the word vectors =/. Does someone have this backed up?

I have one. Can get it to you in an hour or so.

Viksit (mobile)

On Aug 25, 2016, at 10:38, Matthew Honnibal [email protected] wrote:

O_o
Not sure what's going with the --force command, but okay!

I don't have a copy of the word vectors =/. Does someone have this backed up?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Thanks! Uploaded.

I'm on MAC, and neither sputnik nor python --force commands work so far:

sputnik --name spacy --repository-url http://index.spacy.io install en-1.1.0
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/sputnik", line 11, in
sys.exit(main())
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/main.py", line 12, in main
args.run(args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/cli.py", line 45, in run
repository_url=args.repository_url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/init.py", line 44, in install
archive = cache.fetch(package_name)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/cache.py", line 57, in fetch
package = self.get(package_string)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/package_list.py", line 59, in get
candidates = sorted(self.find(package_string))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/package_list.py", line 68, in find
if not name or name == package.name and util.constraint_match(constraint, package.version):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/util.py", line 47, in constraint_match
raise InvalidConstraintException('invalid constraint: %s' % c)
sputnik.util.InvalidConstraintException: invalid constraint: -1.1.0


$ python -m spacy.en.download --force
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/en/download.py", line 13, in
plac.call(main)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(_(args + varargs + extraopts), _kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/en/download.py", line 9, in main
download('en', force)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/spacy/download.py", line 24, in download
package = sputnik.install(about.__title__, about.version, about.models[lang])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/init.py", line 37, in install
index.update()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/index.py", line 84, in update
index = json.load(session.open(request, 'utf8'))
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sputnik/session.py", line 43, in open
r = self.opener.open(request)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(
args)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1241, in https_open
context=self._context)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
urllib2.URLError:

Hm. Try

sputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0

perfect - that worked, thanks Matthew!

$ sputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0
Downloading...
Downloaded 532.28MB 100.00% 5.40MB/s eta 0s
archive.gz checksum/md5 OK
INFO:sputnik.pool:install en-1.1.0

Great! python -m spacy.en.download --force now works for me, too.

... does anyone have the German vectors lying around?

Try

sputnik --name spacy --repository-url http://index.spacy.io install de==1.0.0

I am trying to deploy spacy on AWS Elastic Beanstalk, similar to how it is done in displacy.

container_commands:
install_spacey:
command: python -m spacy.en.download --force

I get this error:

/Command install_spacey] : Activity execution failed, because: Traceback (most recent call last):
File "/usr/lib64/python3.4/runpy.py", line 170, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib64/python3.4/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/python/run/venv/local/lib64/python3.4/site-packages/spacy/en/download.py", line 58, in
plac.call(main)
File "/opt/python/run/venv/local/lib/python3.4/site-packages/plac_core.py", line 309, in call
cmd, result = parser_from(obj).consume(arglist)
File "/opt/python/run/venv/local/lib/python3.4/site-packages/plac_core.py", line 195, in consume
return cmd, self.func(_(args + varargs + extraopts), *_kwargs)
File "/opt/python/run/venv/local/lib64/python3.4/site-packages/spacy/en/download.py", line 42, in main
package = sputnik.install(about.__name__, about.version, about.default_model)
File "/opt/python/run/venv/local/lib/python3.4/site-packages/sputnik/init.py", line 44, in install
archive = cache.fetch(package_name)
File "/opt/python/run/venv/local/lib/python3.4/site-packages/sputnik/cache.py", line 57, in fetch
package = self.get(package_string)
File "/opt/python/run/venv/local/lib/python3.4/site-packages/sputnik/package_list.py", line 61, in get
raise PackageNotFoundException(package_string)
sputnik.package_list.PackageNotFoundException: en_default
(ElasticBeanstalk::ExternalInvocationError)

Does this version of the download command work for you?

sputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0

Thanks @honnibal this appears to be working.

I tried to create my own AWS storage that would be compatible with sputnik but it was quite hard to reverse engineer the tool.

Would it be possible to add further information to the sputnik docs on how one could configure their own storage. Then I could use my own servers (i.e. by specifying --repository-url).

It looks as if this tool would be really helpful for a range of projects but is currently quite focussed on spacy (e.g. the default values).

I've published the server code here: https://github.com/spacy-io/sputnik-server

Docs contributions would be welcome — otherwise, I hope to have some time next week. Basically you just need to run the server, and run a DynamoDB instance.

There are a few small details I'd like to change. In particular, it would be good if the sputnik client could cut the server out by being supplied an argument that specified the data store.

Do you know when it will be possible to use "python3 -m spacy.en.download"?

@honnibal nice idea. It would definitely be beneficial if it was server agnostic (though the etag checks would have to be removed in that case).

@honnibal I seem to have a problem in downloading the model.I tried usingsputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0.
I get forbidden HTTP Error : 403
Traceback (most recent call last): File "/usr/local/bin/sputnik", line 9, in <module> load_entry_point('sputnik==0.9.3', 'console_scripts', 'sputnik')() File "build/bdist.linux-x86_64/egg/sputnik/__main__.py", line 12, in main File "build/bdist.linux-x86_64/egg/sputnik/cli.py", line 45, in run File "build/bdist.linux-x86_64/egg/sputnik/__init__.py", line 37, in install File "build/bdist.linux-x86_64/egg/sputnik/index.py", line 84, in update File "build/bdist.linux-x86_64/egg/sputnik/session.py", line 43, in open File "/usr/lib/python2.7/urllib2.py", line 435, in open response = meth(req, response) File "/usr/lib/python2.7/urllib2.py", line 548, in http_response 'http', request, response, code, msg, hdrs) File "/usr/lib/python2.7/urllib2.py", line 473, in error return self._call_chain(*args) File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain result = func(*args) File "/usr/lib/python2.7/urllib2.py", line 556, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 403: Forbidden.

Also I tried python -m spacy.en.download --force and python -m spacy.en.download.Again I get urllib2.HTTPError: HTTP Error 403: Forbidden.

@honnibal

I'm using sputnik --name spacy --repository-url http://index.spacy.io install en==1.1.0 on a Mac
still getting the error :

Traceback (most recent call last):
File "/Users/mohit/anaconda3/envs/py34/bin/sputnik", line 6, in
sys.exit(sputnik.main.main())
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/site-packages/sputnik/main.py", line 12, in main
args.run(args)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/site-packages/sputnik/cli.py", line 45, in run
repository_url=args.repository_url)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/site-packages/sputnik/init.py", line 37, in install
index.update()
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/site-packages/sputnik/index.py", line 84, in update
index = json.load(session.open(request, 'utf8'))
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/site-packages/sputnik/session.py", line 43, in open
r = self.opener.open(request)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/urllib/request.py", line 470, in open
response = meth(req, response)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/urllib/request.py", line 580, in http_response
'http', request, response, code, msg, hdrs)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/urllib/request.py", line 508, in error
return self._call_chain(_args)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/urllib/request.py", line 442, in _call_chain
result = func(_args)
File "/Users/mohit/anaconda3/envs/py34/lib/python3.4/urllib/request.py", line 588, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

Should hopefully be fixed now – just posted this in #507:

We've been moving the site over to new servers and fixing the SSL certificate to finally address the model downloading issues. The problem was that @honnibal was using CloudFlare's "flexible" SSL option. We now got rid of that and installed a certificate from Let's Encrypt (which is great, btw!)

Make sure to flush your DNS cache before you reload and try again:
In Chrome: chrome://net-internals/#dns
On Mac / OSX 10.11+: sudo dscacheutil -flushcache

Alternatively, you can also use Google Public DNS, which has already updated. That's also what we used for debugging internally for the past few hours. Unfortunately, the DNS is still propagating and seems to be taking forever (see here for the current status).

If you're still having problems, let us know over at #507.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings