Fasttext: load_word2vec_format Error

Created on 16 Mar 2017  路  13Comments  路  Source: facebookresearch/fastText

Hi, I wonder if this error comes from your pre-training file, and what can I do about it?
I'm using this code repo to create an embedding file using your pre-trained word vector .bin (in Hebrew) and my dictionary file.

Even If I changed the line to
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, encoding='utf-8', unicode_errors='ignore')

I always get this error:

INFO:gensim.models.keyedvectors:loading projection weights from word_emb/wiki.he.bin
Traceback (most recent call last):
  File "convert-wordemb-dict2emb-matrix.py", line 128, in <module>
    embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, encoding='utf-8', unicode_errors='ignore')
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 192, in load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 231, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte

I would really appreciate your help, thanks.

Most helpful comment

@dimeldo
Hi again!
That option was too slow to use it regularly, so I opened vectors once, then saved it in binary format and voila, it's working:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName, binary=False) embedding_dict.save_word2vec_format(dictFileName+".bin", binary=True) embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName+".bin", binary=True)

That's my easy way to get a binary model that loads without errors.

All 13 comments

Hi, @dimeldo
I had the same error on fastText pretrained word vector .bin in Russian.

I used the other file in fastText pack, wiki.ru.vec, in my case, with flag binary=False (since it's not binary file), and the pretrained word vector was succesfully uploaded.
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName, binary=False)

Though, it's not the exact solution of problem with .bin file, I hope it may help you.

@dimeldo
Hi again!
That option was too slow to use it regularly, so I opened vectors once, then saved it in binary format and voila, it's working:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName, binary=False) embedding_dict.save_word2vec_format(dictFileName+".bin", binary=True) embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName+".bin", binary=True)

That's my easy way to get a binary model that loads without errors.

That's cool. I've tried the method you described in your fist reply by suggestions of others and it worked quite well for me. But I'm sure your other method may be of valuable information for other people facing the same problem.

Thanks!

Hi @dimeldo and @annasandreeva,

The fastText binary format is different from the word2vec binary format used by Gensim (hence the error when trying to load the fastText binary file using Gensim). This is due to the fact that the fastText binary file also contains information from subword units, which can be used to compute word vectors for out-of-vocabulary words, by using

$ cat "some oov words" | ./fasttext print-vectors model.bin

Best

I had the same error, whether it was set "binary" to 'True' or 'False'.

os: ubuntu14.04

python2.7 or python 3.5

from gensim.models.keyedvectors import KeyedVectors
word2vec_model_path = "./wiki.zh.bin"
word_vectors = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=False)

But use the command 'fastText' is Ok.

Thanks

I came across the same error.
If there is a way to directly use the .bin model trained by fastText in gensim? Thank you

@lxw0109

you can try the solution from @annasandreeva

Hello,

I did the solution proposed by @annasandreeva .

from gensim.models import FastText
import gensim
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format("/project/6008168/tamouze/Python_directory/dataset/wiki.en.vec", binary=False) 
embedding_dict.save_word2vec_format('/project/6008168/tamouze/Python_directory/dataset/saved_model_gensim'+".bin", binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('/project/6008168/tamouze/Python_directory/dataset/saved_model_gensim'+".bin", binary=True)
print ('night' in model.wv.vocab)
print('nights' in model.wv.vocab) 
print('nighto' in model.wv.vocab) 
print model.wv['nighto']
print 'end'

All it is ok but i have a question please:

the two words 'night' and 'nights' are in the vocab. But the word 'nighto' is not. When i try to find it vector, an exception is raised due to this word is not in vocab. How can I solve this error by finding its vector?

Thank you

@TamouzeAssi
From my point of view, when using gensim to "load then save then load again" the fasttext model, the OOV words will NOT work in the model generated by gensim.
Actually I was using the pyfasttext package which works well for me(and it's much faster to use pyfasttext to load the model than gensim).
You can reference my code (Some comments are in Chinese, if you don't understand Chinese, just ignore the link).

@lxw0109 thank you. I will use your work but please does your work can load the pretrained model like wiki.en generated by the origine fastText package?

@TamouzeAssi I did NOT test using pyfasttext to load pretrained model generated by fastText package, but pyfasttext DOES support loading pretrained model by fasttext command line tools and by pyfasttext itself, while fastText does NOT support loading pretrained model by fasttext command line tools(so I have to give up fastText package).

This worked for me

KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')

@girijaravishankar I am still having the error.
'utf8' codec can't decode byte 0xba in position 0: invalid start byte

Was this page helpful?
0 / 5 - 0 ratings