Spacy: Queston: Difference btw Spacy WordVec and Gensim/Google WordVec

Created on 13 Apr 2016 · 8Comments · Source: explosion/spaCy

Hi ,

Thanks a lot for your fantastic tool, keep up with the good work!
I want to ask you the difference between the Google word vector library ( https://code.google.com/archive/p/word2vec/ ) and the one you use in Spacy.

Kind regards

Source

ArdavanA

👍1

All 8 comments

Google's wordvec is able to generate word vectors from text. Spacy makes it easy to load these and other word vectors so that you can use them in your NLP tasks.

By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model but you can also load Google's word2vec or Glove vectors. Please see this blog post for more details on how to do that:

https://spacy.io/docs/tutorials/load-new-word-vectors

elyase on 14 Apr 2016

Thanks Yasser

ArdavanA on 14 Apr 2016

Easiest way to load GloVe vectors is now:

import spacy

nlp = spacy.load('en', vectors='en_glove_cc_300_1m')

This will load a subset of the GloVe common crawl vectors --- it'll give you vectors for 1m words. This is a large vocabulary and you should get high coverage with this, without the crazy memory requirements of the original unpruned data.

This function isn't well documented yet, because we've only recently stabilised the API. I'll fix the blog post.

honnibal on 14 Apr 2016

this doesn't work and throws exception:

name = 'en_glove_cc_300_1m'
def get_lang_class(name):
lang = re.split('[^a-zA-Z0-9_]', name, 1)[0]
if lang not in LANGUAGES:
raise RuntimeError('Language not supported: %s' % lang)
RuntimeError: Language not supported: en_glove_cc_300_1m

the reason is the regex should be just '_', which will work fine both for 'en' and for 'en_glove_cc_300_1m' returning the desired 'en'

However even after fixing the regex there is another exception:

name = 'en_glove_cc_300_1m', via = None
def get_package_by_name(name=None, via=None):
if name is None:
return
lang = get_lang_class(name)
try:
return sputnik.package(about.title, about.version,
name, data_path=via)
except PackageNotFoundException as e:
raise RuntimeError("Model '%s' not installed. Please run 'python -m "
"%s.download' to install latest compatible "
"model." % (name, lang.module))
RuntimeError: Model 'en_glove_cc_300_1m' not installed. Please run 'python -m >spacy.en.download' to install latest compatible model.

running "python -m spacy.en.download --force all" doesn't help

running version 0.101.0
any thoughts?

aie0 on 11 May 2016

👍1

Ran into the same issue. Per @aie0's suggestion I switched lang = re.split('[^a-zA-Z0-9_]', name, 1)[0] to lang = re.split('_', name, 1)[0]. Also, I did nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') insead of nlp = spacy.load('en', vectors='en_glove_cc_300_1m'). The extra _vectors did the trick for me.

daylen on 16 May 2016

👍1

This should all be cleaned up in 1.0 — the GloVe vectors are installed by default, and it's much easier to use different vectors.

honnibal on 20 Oct 2016

🎉1

i always get this error even after installing the 'en':
ValueError: Word vectors set to length 0. This may be because the data is not installed. If you haven't already, run
python -m spacy.en.download all
to install the data.