Gensim: Gensim error while loading Hebrew

Created on 3 May 2017  路  9Comments  路  Source: RaRe-Technologies/gensim

Description

Gensim error while loading Hebrew

Steps/Code/Corpus to Reproduce

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format('wiki.he')

Expected Results

#### Actual Results

AssertionError Traceback (most recent call last)
in ()
2
3 #num_dims = 300
----> 4 model = FastText.load_fasttext_format('wiki.he')

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
253 with utils.smart_open(model_binary_file, 'rb') as f:
254 self.load_model_params(f)
--> 255 self.load_dict(f)
256 self.load_vectors(f)
257

/home/vmadmin/anaconda3/lib/python3.5/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
274 (vocab_size, nwords, _) = self.struct_unpack(file_handle, '@3i')
275 # Vocab stored by Dictionary::save
--> 276 assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'
277 assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
278 ntokens, = self.struct_unpack(file_handle, '@q')

AssertionError: mismatch between vocab sizes

Versions

Linux-4.4.0-75-generic-x86_64-with-debian-stretch-sid
Python 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
NumPy 1.11.1
SciPy 0.18.1
gensim 1.0.1
FAST_VERSION 2

All 9 comments

Looks a bit like https://github.com/RaRe-Technologies/gensim/issues/1236. The error message might seem to imply a problem with how fasttext produces the data. Whereas the description above used gensim 1.0.1, it also reproduces with gensym 2.0.0.

File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 239, in load_fasttext_format
model.load_binary_data('%s.bin' % model_file, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 256, in load_binary_data
self.load_dict(f, encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
assert len(self.wv.vocab) == nwords, 'mismatch between vocab sizes'

Notably this doesn't happen with all pretrained (Hebrew) embeddings created by fasttext.

Has this issue been resolved?
If yes, can you please share the reference?

Hi @kewlcoder This is not an issue with gensim wrapper but an issue with the trained FastText model as there is a mismatch in vocab between .bin and .vec files. @prakhar2b , could you please investigate and raise an issue in FastText repo?

Please use FastText.load_word2vec_format('FILENAME.vec') as a workaround for now.

@tmylk @jayantj The mismatch is not only for the pretrained models released by facebook, but we are getting this error for all the models trained by fasttext, which was not the case earlier. Something might have changed in the fastText, looking into the code to see if it's intentional, then I'll raise an issue in the fastText repo.

@prakhar2b Sounds good.
Also, the assertion to check for mismatch between vocab sizes between the .vec and .bin file was written as part of a defensive approach to make sure there weren't any "silent" bugs.
In case the mismatch doesn't make an actual difference, and it is possible to proceed with loading the model, changing the assert to a warning log would be a decent solution, IMO.

About 8 days ago, fastText added two additional int32_t's to the .bin model header. These are "magic" and "version." As of right now, fasttext.py does not account for these integers when reading in model parameters causing it to be off by 2 integers for the rest of the reads. Take a look at the "checkModel" function in https://github.com/facebookresearch/fastText/blob/master/src/fasttext.cc to see what I'm talking about.

EDIT:
Also important to note is the addition of dictionary pruning which adds an int64_t (pruneidx_size) following the previous int64_t (ntokens) when reading in the dictionary from the .bin file. This is not accounted for in the current version of fasttext.py.

I have made some quick edits to fasttext.py that make it compatible with the latest version of fastText models. All I did was add additional reads so a plain model without quantization/dictionary pruning can be read. Please note that I didn't do any extensive testing so use at your own risk.

fasttext.zip

@dgg5503 @tmylk @jayantj struct_unpack for models trained by fastText using text8 data -
(self.wv.vocab from .vec file, nwords & vocab_size from .bin using struct.unpack)

parameter | fastText (old) | fastText(new)
------------- | ---------------- | ----------------
len(self.wv.vocab) | 71290 | 71290
nwords | 71290 | 1058682594
vocab_size | 71290 | -350469331

I'm not sure why we have negative value for vocab_size. @jayantj (please comment on this). If this is undesirable, we should report it to fastText repo.

Also, if this is an intentional mismatch (which seems to be), then there is no point in making any assert or warning statement. I think we should use vec and bin file separately for different purposes assuming that facebook's fastText code is working fine. This was also discussed for issue #1261 (improving fasttext loading time) while making comparision with salestock's fasttext loading mechanism which only used bin file for loading.

@prakhar2b Are you sure about this? Looking at those values, it seems very likely to me that we are reading the wrong bytes for the values of nwords and vocab_size.

Also, on a different note: the issue raised about the model trained on French wiki is quite old, before the FastText magic or version variables were added. I believe they are probably two different issues.

UPDATE : I've solved this. I'll submit a final PR asap. Thanks @dgg5503 for the suggestions.

Was this page helpful?
0 / 5 - 0 ratings