After upgrading to 3.3.0, it is now impossible to get the model's vocabulary with model.wv.vocab method, if the model is loaded from a text or binary word2vec file. However, it works for models saved in the Gensim native format.
I suppose it is related to re-designing vector models implementations in #1777. Anyway, it is not good to break compatibility in this way, without even notifying users.
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.KeyedVectors.load_word2vec_format('ANY_MODEL.bin.gz', binary=True)
WORD in model.wv.vocab
True or False, as it is in Gensim 3.2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'wv'
Linux-4.13.0-32-generic-x86_64-with-LinuxMint-18.2-sonya
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1
@akutuzov thanks for the report! Sorry for this, we did not plan anything to break (but this happens :( ).
CC: @manneshiva
Hi @akutuzov,
Thanks for reporting this issue. This shall be fixed very soon (a couple of hours from now). I tested your code in gensim 3.2.0 and saw that model is model.wv returns True. So, for the time being, you could use model.vocab instead of model.wv.vocab (or any other property).
I seem to have missed the self-referential property for KeyedVectors -- https://github.com/RaRe-Technologies/gensim/blob/3.2.0/gensim/models/keyedvectors.py#L422. Not sure about the purpose of this property. Will add it back for backward compatibility.
@manneshiva thanks!
So, model.wv.vocab is deprecated now, and we should use model.vocab instead, right?
@akutuzov exactly
If model.wv.vocab is deprecated and we should always use model.vocab, why then model.vocab does not work for word2vec models saved in Gensim native format?
model = gensim.models.Word2Vec.load(MODELFILE)
print(len(model.vocab))
AttributeError: 'Word2Vec' object has no attribute 'vocab'
print(len(model.wv.vocab))
237255
I use Gensim 3.4.0 both for training and for loading the models.
The funny thing is that if the same model is saved in word2vec format and loaded via gensim.models.KeyedVectors.load_word2vec_format, then both model.vocab and model.wv.vocab work.
So, is there any recommended way to access the model's vocabulary independent of how the model was loaded?
what if i want to update the model loaded with syntax (gensim.models.KeyedVectors.load_word2vec_format) by new sentences
I tried : showing error
model.build_vocab(more_sentences, update=True)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'build_vocab'
@akutuzov Sounds like a (nasty) bug to me. Can you replicate this in 3.5.0?
@menshikh-iv if the bug is still there, should we re-open this issue?
@rachhitgarg see the documentation under https://radimrehurek.com/gensim/models/word2vec.html#usage-examples
@piskvorky Yes, nothing has changed in 3.5.0 in this respect. The bug is still reproduced: for some weird reason model.vocab does not work for _word2vec_ models saved in Gensim native format.
Thanks @akutuzov . @menshikh-iv I'm re-opening this ticket, this sounds serious to critical. Do we have a unit test for testing load-after-save?
@rachhitgarg please stop post this to unrelated issues, I asnwered you https://github.com/RaRe-Technologies/gensim/issues/1994#issuecomment-417164089
@piskvorky yes, many different, just Ctrl+F Word2vec.load in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_word2vec.py (but case mentioned by @akutuzov not covered)
Most helpful comment
@manneshiva thanks!
So,
model.wv.vocabis deprecated now, and we should usemodel.vocabinstead, right?