Gensim: Accesing vector model vocabulary broken in Gensim 3.3 when loading from word2vec format

Created on 7 Feb 2018  路  11Comments  路  Source: RaRe-Technologies/gensim

After upgrading to 3.3.0, it is now impossible to get the model's vocabulary with model.wv.vocab method, if the model is loaded from a text or binary word2vec file. However, it works for models saved in the Gensim native format.
I suppose it is related to re-designing vector models implementations in #1777. Anyway, it is not good to break compatibility in this way, without even notifying users.

Steps/ to Reproduce

import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.KeyedVectors.load_word2vec_format('ANY_MODEL.bin.gz', binary=True)
WORD in model.wv.vocab

Expected Results

True or False, as it is in Gensim 3.2

Actual Results

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'wv'

Versions

Linux-4.13.0-32-generic-x86_64-with-LinuxMint-18.2-sonya
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 1
bug difficulty easy

Most helpful comment

@manneshiva thanks!
So, model.wv.vocab is deprecated now, and we should use model.vocab instead, right?

All 11 comments

@akutuzov thanks for the report! Sorry for this, we did not plan anything to break (but this happens :( ).

CC: @manneshiva

Hi @akutuzov,
Thanks for reporting this issue. This shall be fixed very soon (a couple of hours from now). I tested your code in gensim 3.2.0 and saw that model is model.wv returns True. So, for the time being, you could use model.vocab instead of model.wv.vocab (or any other property).
I seem to have missed the self-referential property for KeyedVectors -- https://github.com/RaRe-Technologies/gensim/blob/3.2.0/gensim/models/keyedvectors.py#L422. Not sure about the purpose of this property. Will add it back for backward compatibility.

@manneshiva thanks!
So, model.wv.vocab is deprecated now, and we should use model.vocab instead, right?

@akutuzov exactly

If model.wv.vocab is deprecated and we should always use model.vocab, why then model.vocab does not work for word2vec models saved in Gensim native format?

model = gensim.models.Word2Vec.load(MODELFILE)
print(len(model.vocab))
AttributeError: 'Word2Vec' object has no attribute 'vocab'
print(len(model.wv.vocab))
237255

I use Gensim 3.4.0 both for training and for loading the models.

The funny thing is that if the same model is saved in word2vec format and loaded via gensim.models.KeyedVectors.load_word2vec_format, then both model.vocab and model.wv.vocab work.
So, is there any recommended way to access the model's vocabulary independent of how the model was loaded?

what if i want to update the model loaded with syntax (gensim.models.KeyedVectors.load_word2vec_format) by new sentences
I tried : showing error

model.build_vocab(more_sentences, update=True)
AttributeError: 'Word2VecKeyedVectors' object has no attribute 'build_vocab'

@akutuzov Sounds like a (nasty) bug to me. Can you replicate this in 3.5.0?

@menshikh-iv if the bug is still there, should we re-open this issue?

@rachhitgarg see the documentation under https://radimrehurek.com/gensim/models/word2vec.html#usage-examples

@piskvorky Yes, nothing has changed in 3.5.0 in this respect. The bug is still reproduced: for some weird reason model.vocab does not work for _word2vec_ models saved in Gensim native format.

Thanks @akutuzov . @menshikh-iv I'm re-opening this ticket, this sounds serious to critical. Do we have a unit test for testing load-after-save?

@rachhitgarg please stop post this to unrelated issues, I asnwered you https://github.com/RaRe-Technologies/gensim/issues/1994#issuecomment-417164089

@piskvorky yes, many different, just Ctrl+F Word2vec.load in https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_word2vec.py (but case mentioned by @akutuzov not covered)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Laubeee picture Laubeee  路  3Comments

bgokden picture bgokden  路  3Comments

jeradf picture jeradf  路  4Comments

coopwilliams picture coopwilliams  路  3Comments

sairampillai picture sairampillai  路  3Comments