I get this error while loading wiki.en.vec from FastText pre-trained Word2Vec model. See here for this model.
2017-06-23 16:41:40,834 : INFO : loading Word2Vec object from /Volumes/Dataset/word2vec/wiki.en/wiki.en.vec
Traceback (most recent call last):
File "loadlyricsmodel.py", line 45, in <module>
model = Word2Vec.load( model_filepath )
File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/models/word2vec.py", line 1382, in load
model = super(Word2Vec, cls).load(*args, **kwargs)
File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 271, in load
obj = unpickle(fname)
File "/Users/loretoparisi/Documents/Projects/word2vec/.env/lib/python2.7/site-packages/gensim/utils.py", line 935, in unpickle
return _pickle.loads(f.read())
cPickle.UnpicklingError: unpickling stack underflow
loaded with
model = Word2Vec.load( model_filepath )
I'm using
gensim-2.2.0-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl
Word2Vec.load() only loads models saved from gensim. (It uses Python pickling.)
I believe that .vec file is in the format used by the original Google word2vec.c (and now FastText) for its top-level vectors, so KeyedVectors.load_word2vec_format() may work, perhaps with a binary=False parameter.
The method gensim.models.wrappers.fasttext.FastText.load_fasttext_format() may also be relevant to bring in ngrams for OOV word vector synthesis may by of interest too... but I'm not sure if it's yet doing the right thing in the released gensim, as compared to PR-in-progress #1341.
@jayantj @prakhar2b wdyt?
@gojomo yes, KeyedVectors.load_word2vec_format() will definitely work here, and also binary=False is default parameter.
As for OOV word synthesis, what do you mean by not sure if it's yet doing the right thing in the released gensim. I think for OOV, we need n-gram informations which is provided in .bin file.
As of now, gensim.models.wrappers.fasttext.FastText.load_fasttext_format() is used to load complete model for this purpose using both vec and bin files. With PR#1341, we will need only bin file, rest all functionalities will remain same I believe.
cc @jayantj @menshikh-iv
Yes, with the .bin AND the .vec file, you can load the complete model using -
from gensim.models.wrappers.fasttext import FastText
model = FastText.load_fasttext_format('/path/to/model') # without the .bin/.vec extension
With the .vec file, you can load only the word vectors (and not the out-of-vocab word information) using -
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('/path/to/model.vec') # with the .vec extension
@jayantj Thank, let me try first with the load_fasttext_format and FastText wrapper
@prakhar2b My "not sure" comment was regarding to some discussion I saw on another issue or PR in progress, perhaps the one that's also discussing whether the discarding-of-untrained-ngrams is a necessary optimization – I had the impression our calculation might be diverging from the original FB fasttext on some (perhaps just OOV) words. (And even if that's defensible, because the untrained ngrams are still just random vectors, it might not be the 'right thing' overall because it may violate user expectations that whether loaded into original FT code, or gensim FT code, OOV words get the same vectors from the same loaded model.)
We definitely want to follow whatever the original FT does -- the path of least surprise for anyone migrating / trying both.
Most helpful comment
Word2Vec.load()only loads models saved from gensim. (It uses Python pickling.)I believe that
.vecfile is in the format used by the original Google word2vec.c (and now FastText) for its top-level vectors, soKeyedVectors.load_word2vec_format()may work, perhaps with abinary=Falseparameter.The method
gensim.models.wrappers.fasttext.FastText.load_fasttext_format()may also be relevant to bring in ngrams for OOV word vector synthesis may by of interest too... but I'm not sure if it's yet doing the right thing in the released gensim, as compared to PR-in-progress #1341.