Gensim: AssertionError: expected to reach EOF when loading full FastText model

Created on 4 Apr 2019 · 5Comments · Source: RaRe-Technologies/gensim

Problem description

I am trying to fine-tune a pretrained FastText using gensim. I use the weights from the official Facebook implementation. Partial loading works fine, but full model loading results in AssertionError.

Steps/code/corpus to reproduce

import gensim
model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)

results in

AssertionError                            Traceback (most recent call last)
<ipython-input-16-1896fcc1d1cb> in <module>
----> 1 model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
   1012 
   1013         """
-> 1014         return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
   1015 
   1016     def load_binary_data(self, encoding='utf8'):

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
   1246         model_file += '.bin'
   1247     with smart_open(model_file, 'rb') as fin:
-> 1248         m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
   1249 
   1250     model = FastText(

~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py in load(fin, encoding, full_model)
    264     else:
    265         hidden_output = _load_matrix(fin, new_format=new_format)
--> 266         assert fin.read() == b'', 'expected to reach EOF'
    267 
    268     model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)

AssertionError: expected to reach EOF

Versions

Please provide the output of:

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)

Linux-4.4.0-139-generic-x86_64-with-debian-stretch-sid
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) 
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 1

need info

Source

nshaud

Most helpful comment

I tried again with gensim 3.7.2 after redownloading the model file from Facebook's FastText page and it seems to work. The md5 checksums of old and new files are not the same, so I guess a corrupted model was the problem.

nshaud on 23 Apr 2019

👍2

All 5 comments

It still fails using the fasttext.load_facebook_model method, however using the French embeddings, it works:

import gensim
model = gensim.models.fasttext.load_facebook_model('/data/cc.fr.300.bin') 
model.wv['test']
# array([ 0.03151339, -0.04408491, ... 0.0188015 ,  0.032352  ], dtype=float32)

It also works using the Wikipedia English embeddings (wiki.en.bin).
Does this mean that there is something wrong with the format of cc.en.300.bin ?

nshaud on 16 Apr 2019

Thank you for reporting this. Could you provide full URLs to the models you are using, so I can try to reproduce this?

mpenkov on 17 Apr 2019

Here are all the models I mentioned:

cc.en.300.bin = FastText English CommonCrawl
cc.fr.300.bin = same in French
wiki.en.bin = Wiki Word Vectors from FastText

nshaud on 17 Apr 2019

👍1

I think gensim 3.7.2 already fixed this problem. Could you please double check?

(372.env) mpenkov@hetrad2:~/data/2435$ pip freeze | grep gensim
gensim==3.7.2
(372.env) mpenkov@hetrad2:~/data/2435$ cat bug.py
import gensim.models.fasttext
vector = gensim.models.fasttext.load_facebook_vectors('../cc.en.300.bin') 
print(vector)
model = gensim.models.fasttext.load_facebook_model('../cc.en.300.bin') 
print(model)
(372.env) mpenkov@hetrad2:~/data/2435$ python bug.py 
<gensim.models.keyedvectors.FastTextKeyedVectors object at 0x7f815e2005c0>
FastText(vocab=2000000, size=300, alpha=0.025)
(372.env) mpenkov@hetrad2:~/data/2435$

mpenkov on 20 Apr 2019

nshaud on 23 Apr 2019

👍2

Was this page helpful?

0 / 5 - 0 ratings