I am trying to fine-tune a pretrained FastText using gensim. I use the weights from the official Facebook implementation. Partial loading works fine, but full model loading results in AssertionError.
import gensim
model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)
results in
AssertionError Traceback (most recent call last)
<ipython-input-16-1896fcc1d1cb> in <module>
----> 1 model = gensim.models.FastText.load_fasttext_format('cc.en.300.bin', full_model=True)
~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in load_fasttext_format(cls, model_file, encoding, full_model)
1012
1013 """
-> 1014 return _load_fasttext_format(model_file, encoding=encoding, full_model=full_model)
1015
1016 def load_binary_data(self, encoding='utf8'):
~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/fasttext.py in _load_fasttext_format(model_file, encoding, full_model)
1246 model_file += '.bin'
1247 with smart_open(model_file, 'rb') as fin:
-> 1248 m = gensim.models._fasttext_bin.load(fin, encoding=encoding, full_model=full_model)
1249
1250 model = FastText(
~/.anaconda3/envs/qs3.6/lib/python3.6/site-packages/gensim/models/_fasttext_bin.py in load(fin, encoding, full_model)
264 else:
265 hidden_output = _load_matrix(fin, new_format=new_format)
--> 266 assert fin.read() == b'', 'expected to reach EOF'
267
268 model.update(vectors_ngrams=vectors_ngrams, hidden_output=hidden_output)
AssertionError: expected to reach EOF
Please provide the output of:
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.__version__)
import scipy; print("SciPy", scipy.__version__)
import gensim; print("gensim", gensim.__version__)
from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
Linux-4.4.0-139-generic-x86_64-with-debian-stretch-sid
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0]
NumPy 1.16.2
SciPy 1.2.1
gensim 3.7.1
FAST_VERSION 1
It still fails using the fasttext.load_facebook_model method, however using the French embeddings, it works:
import gensim
model = gensim.models.fasttext.load_facebook_model('/data/cc.fr.300.bin')
model.wv['test']
# array([ 0.03151339, -0.04408491, ... 0.0188015 , 0.032352 ], dtype=float32)
It also works using the Wikipedia English embeddings (wiki.en.bin).
Does this mean that there is something wrong with the format of cc.en.300.bin ?
Thank you for reporting this. Could you provide full URLs to the models you are using, so I can try to reproduce this?
Here are all the models I mentioned:
I think gensim 3.7.2 already fixed this problem. Could you please double check?
(372.env) mpenkov@hetrad2:~/data/2435$ pip freeze | grep gensim
gensim==3.7.2
(372.env) mpenkov@hetrad2:~/data/2435$ cat bug.py
import gensim.models.fasttext
vector = gensim.models.fasttext.load_facebook_vectors('../cc.en.300.bin')
print(vector)
model = gensim.models.fasttext.load_facebook_model('../cc.en.300.bin')
print(model)
(372.env) mpenkov@hetrad2:~/data/2435$ python bug.py
<gensim.models.keyedvectors.FastTextKeyedVectors object at 0x7f815e2005c0>
FastText(vocab=2000000, size=300, alpha=0.025)
(372.env) mpenkov@hetrad2:~/data/2435$
I tried again with gensim 3.7.2 after redownloading the model file from Facebook's FastText page and it seems to work. The md5 checksums of old and new files are not the same, so I guess a corrupted model was the problem.
Most helpful comment
I tried again with gensim 3.7.2 after redownloading the model file from Facebook's FastText page and it seems to work. The md5 checksums of old and new files are not the same, so I guess a corrupted model was the problem.