Gensim: Gensim error when loading French FastText

Created on 23 Mar 2017  路  14Comments  路  Source: RaRe-Technologies/gensim

Hello,

I'm trying to use the fasttext wrapper in order to load the French model that one can find here. Unfortunately I get the following error:

Traceback (most recent call last):
  File "app.py", line 18, in <module>
    model = FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 238, in load_fasttext_format
    model.load_binary_data('%s.bin' % model_file)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 277, in load_dict
    assert len(self.wv.vocab) == vocab_size, 'mismatch between vocab sizes'
AssertionError: mismatch between vocab sizes

I'm using the following environment:

>>> import platform; print(platform.platform())
Darwin-16.4.0-x86_64-i386-64bit
>>> import sys; print("Python", sys.version)
('Python', '2.7.13 (default, Dec 28 2016, 14:29:07) \n[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)]')
>>> import numpy; print("NumPy", numpy.__version__)
('NumPy', '1.12.0')
>>> import scipy; print("SciPy", scipy.__version__)
('SciPy', '0.19.0')
>>> import gensim; print("gensim", gensim.__version__)
('gensim', '1.0.1')
>>> from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
('FAST_VERSION', 0)

Steps to reproduce the error:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

I don't know if it is a bug from gensim or an issue from the used model. Any help would be appreciated.

Thanks in advance.

bug difficulty medium

Most helpful comment

Try the develop branch of Gensim, I think #1189 has something to do with your problem.

All 14 comments

Try the develop branch of Gensim, I think #1189 has something to do with your problem.

Unfortunately I get the exact same error, here the steps I have done:

cd /tmp
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.fr.zip
unzip wiki.fr.zip
pip uninstall gensim
git clone https://github.com/RaRe-Technologies/gensim
PYTHONPATH="/tmp/gensim" python -c "import os;from gensim.models.wrappers import FastText;FastText.load_fasttext_format(os.path.abspath('./wiki.fr'))"

Thanks for reporting. The error is different from ValueError: invalid vector on line 12898 fixed by @jayantj in #1189.

It might be accidentally fixed in #1214 branch - you are welcome to clone that code.

It would be easier to fix if there was some smaller model to reproduce... Unfortunately the download takes many hours.

I tried the same steps than previously but the cloned repo is "https://github.com/jaksmid/gensim". And I still get the exact same error :(

Can you partially load the model with model = FastText.load_word2vec_format('FILENAME.vec')?

The failing part is model.load_binary_data('FILENAME.bin') but you might not need that, depending on your use case.

I've managed to download the model, looking into the bug.

Thanks for looking into this @jayantj . I will make a new release after this is fixed.

@tmylk your proposal to FastText.load_word2vec_format('FILENAME.vec') is working.

There is a mismatch in vocab between .bin and .vec files. We should raise it with FastText project that created the model. CC @prakhar2b

Thanks for the update!

Has this issue been resolved?
If yes, can you please share the reference?

@kewlcoder replied to the same question in #1301

The issue for loading the French wiki is most likely due to a FastText bug - reported here - https://github.com/facebookresearch/fastText/issues/218

The issue with loading the latest FastText models (including the Hebrew model) is due to a change in the way the new models are stored, and will be fixed in #1319

Fixed in #1341 & #1319

Was this page helpful?
0 / 5 - 0 ratings