Gensim: Puzzling deprecated warnings

Created on 6 Dec 2018 · 3Comments · Source: RaRe-Technologies/gensim

I am using word vectors with gensim 3.4.0 and I am genuinely puzzled by the warnings I get when I try to load and use embeddings created with possibly older versions of gensim.
All of the models get loaded using gensim.models.KeyedVectors.load(file, mmap='r')

When I create a completely new instance using model=gensim.models.Word2Vec():

the model does not have a vectors or vectors_norm attribute and doing "x" in model shows "DeprecationWarning: Call to deprecated __contains__ (Method will be removed in 4.0.0, use self.wv.__contains__() instead)."
model.wv.syn0 informs about " DeprecationWarning: Call to deprecated syn0 (Attribute will be removed in 4.0.0, use self.wv.vectors instead)."

So it looks as if I SHOULD always use model.wv and model.wv.vectors

But when I load glove embeddings that have been created probably with an older version of gensim:

when I do model.wv.vectors I get: "DeprecationWarning: Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead)."

So this tells me I should NOT use wv. The model does have model.vectors while others do not even have that attribute. Some have model.wv.syn0 or model.syn0 while still others dont.

I cannot figure out what the correct way is to make my code always work, and work with all models and not show conflicting deprecation messages ...

bug difficulty medium impact HIGH

Source

johann-petrak

Most helpful comment

A simple model=gensim.models.Word2Vec() won't have yet created the internal model.wv.vectors property – since that requires the vocabulary-survey to have completed (as after calling build_vocab()).

A full Word2Vec model is different than just a set-of-word-vectors – and indeed it contains a set-of-word-vectors, in its wv property. So If you're loading either GloVE vectors, or a plain set of word2vec-trained-word-vectors, your object will itself be some kind of KeyedVectors – and you wouldn't access any parts of via an extra wv property. It already is the word-vectors. If on the other hand you're loading a full Word2Vec model, you could conceivably do more training, or extract other details about the vocabulary or other internal weights of the model. Those operations would be direct on the Word2Vec model, whereas plain word-vector operations can and should be via the wv property (which can be saved separately, and probably should be saved separately if all you want it a set-of-word-vectors).

So I believe all the warnings you're seeing are technically correct, just a bit confusing because of the way things used to be (no distinction between the thicker full model and a mere set-of-word-vectors), a bunch of still-present backward-compatibility options (which can generate warnings), and the convention of using the variable name model whether it's really a full trainable model, or just a set-of-word-vectors. (Using contrasting variable names like w2v_model for full models, but just word_vecs for the simpler KeyedVector instances, might help clear up the uses.)

gojomo on 13 Dec 2018

👍2

All 3 comments

@johann-petrak
I think you should use "x" in model.wv as it is said in the first warning ("_DeprecationWarning: Call to deprecated __contains__ (Method will be removed in 4.0.0, use self.wv.__contains__() instead)._").

Third warning is something strange. Message "_Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead)_" relates to wv attribute of WordEmbeddingsKeyedVectors (link), not Word2Vec. WordEmbeddingsKeyedVectors is the parent class of Word2VecKeyedVectors, which instance are assigned to wv atrribute os Word2Vec. Such deprecation is only one in all gensim. In theory this deprecation should be raised when you do something like model.wv.wv.