Spacy: Queston: Difference btw Spacy WordVec and Gensim/Google WordVec

Created on 13 Apr 2016  Â·  8Comments  Â·  Source: explosion/spaCy

Hi ,

Thanks a lot for your fantastic tool, keep up with the good work!
I want to ask you the difference between the Google word vector library ( https://code.google.com/archive/p/word2vec/ ) and the one you use in Spacy.

Kind regards

All 8 comments

Google's wordvec is able to generate word vectors from text. Spacy makes it easy to load these and other word vectors so that you can use them in your NLP tasks.

By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model but you can also load Google's word2vec or Glove vectors. Please see this blog post for more details on how to do that:

https://spacy.io/docs/tutorials/load-new-word-vectors

Thanks Yasser

Easiest way to load GloVe vectors is now:

import spacy

nlp = spacy.load('en', vectors='en_glove_cc_300_1m')

This will load a subset of the GloVe common crawl vectors --- it'll give you vectors for 1m words. This is a large vocabulary and you should get high coverage with this, without the crazy memory requirements of the original unpruned data.

This function isn't well documented yet, because we've only recently stabilised the API. I'll fix the blog post.

this doesn't work and throws exception:

name = 'en_glove_cc_300_1m'
def get_lang_class(name):
lang = re.split('[^a-zA-Z0-9_]', name, 1)[0]
if lang not in LANGUAGES:
raise RuntimeError('Language not supported: %s' % lang)
RuntimeError: Language not supported: en_glove_cc_300_1m

the reason is the regex should be just '_', which will work fine both for 'en' and for 'en_glove_cc_300_1m' returning the desired 'en'

However even after fixing the regex there is another exception:

name = 'en_glove_cc_300_1m', via = None
def get_package_by_name(name=None, via=None):
if name is None:
return
lang = get_lang_class(name)
try:
return sputnik.package(about.title, about.version,
name, data_path=via)
except PackageNotFoundException as e:
raise RuntimeError("Model '%s' not installed. Please run 'python -m "
"%s.download' to install latest compatible "
"model." % (name, lang.module))
RuntimeError: Model 'en_glove_cc_300_1m' not installed. Please run 'python -m >spacy.en.download' to install latest compatible model.

running "python -m spacy.en.download --force all" doesn't help

running version 0.101.0
any thoughts?

Ran into the same issue. Per @aie0's suggestion I switched lang = re.split('[^a-zA-Z0-9_]', name, 1)[0] to lang = re.split('_', name, 1)[0]. Also, I did nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') insead of nlp = spacy.load('en', vectors='en_glove_cc_300_1m'). The extra _vectors did the trick for me.

This should all be cleaned up in 1.0 — the GloVe vectors are installed by default, and it's much easier to use different vectors.

i always get this error even after installing the 'en':
ValueError: Word vectors set to length 0. This may be because the data is not installed. If you haven't already, run
python -m spacy.en.download all
to install the data.

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Was this page helpful?
0 / 5 - 0 ratings