Fasttext: load_word2vec_format Error

Created on 16 Mar 2017 · 13Comments · Source: facebookresearch/fastText

Hi, I wonder if this error comes from your pre-training file, and what can I do about it?
I'm using this code repo to create an embedding file using your pre-trained word vector .bin (in Hebrew) and my dictionary file.

Even If I changed the line to
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, encoding='utf-8', unicode_errors='ignore')

I always get this error:

INFO:gensim.models.keyedvectors:loading projection weights from word_emb/wiki.he.bin
Traceback (most recent call last):
  File "convert-wordemb-dict2emb-matrix.py", line 128, in <module>
    embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, encoding='utf-8', unicode_errors='ignore')
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 192, in load_word2vec_format
    header = utils.to_unicode(fin.readline(), encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 231, in any2unicode
    return unicode(text, encoding, errors=errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte

I would really appreciate your help, thanks.

Source

dimeldo

Most helpful comment

@dimeldo
Hi again!
That option was too slow to use it regularly, so I opened vectors once, then saved it in binary format and voila, it's working:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName, binary=False) embedding_dict.save_word2vec_format(dictFileName+".bin", binary=True) embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName+".bin", binary=True)

That's my easy way to get a binary model that loads without errors.

annasandreeva on 15 Apr 2017

👍19 ❤6 🎉1

All 13 comments

Hi, @dimeldo
I had the same error on fastText pretrained word vector .bin in Russian.

I used the other file in fastText pack, wiki.ru.vec, in my case, with flag binary=False (since it's not binary file), and the pretrained word vector was succesfully uploaded.
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(dictFileName, binary=False)

Though, it's not the exact solution of problem with .bin file, I hope it may help you.

annasandreeva on 4 Apr 2017

👍3

That's my easy way to get a binary model that loads without errors.

annasandreeva on 15 Apr 2017

👍19 ❤6 🎉1

That's cool. I've tried the method you described in your fist reply by suggestions of others and it worked quite well for me. But I'm sure your other method may be of valuable information for other people facing the same problem.

Thanks!

dimeldo on 15 Apr 2017

Hi @dimeldo and @annasandreeva,

The fastText binary format is different from the word2vec binary format used by Gensim (hence the error when trying to load the fastText binary file using Gensim). This is due to the fact that the fastText binary file also contains information from subword units, which can be used to compute word vectors for out-of-vocabulary words, by using

$ cat "some oov words" | ./fasttext print-vectors model.bin

Best

EdouardGrave on 18 Apr 2017

👍2

I had the same error, whether it was set "binary" to 'True' or 'False'.

os: ubuntu14.04

python2.7 or python 3.5

from gensim.models.keyedvectors import KeyedVectors
word2vec_model_path = "./wiki.zh.bin"
word_vectors = KeyedVectors.load_word2vec_format(word2vec_model_path, binary=False)

But use the command 'fastText' is Ok.

Thanks

JinmingZhao on 9 Aug 2017

I came across the same error.
If there is a way to directly use the .bin model trained by fastText in gensim? Thank you

lxw0109 on 1 Jan 2018

@lxw0109

you can try the solution from @annasandreeva

twmht on 16 Mar 2018

Hello,

I did the solution proposed by @annasandreeva .

from gensim.models import FastText
import gensim
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format("/project/6008168/tamouze/Python_directory/dataset/wiki.en.vec", binary=False) 
embedding_dict.save_word2vec_format('/project/6008168/tamouze/Python_directory/dataset/saved_model_gensim'+".bin", binary=True)
model = gensim.models.KeyedVectors.load_word2vec_format('/project/6008168/tamouze/Python_directory/dataset/saved_model_gensim'+".bin", binary=True)
print ('night' in model.wv.vocab)
print('nights' in model.wv.vocab) 
print('nighto' in model.wv.vocab) 
print model.wv['nighto']
print 'end'

All it is ok but i have a question please:

the two words 'night' and 'nights' are in the vocab. But the word 'nighto' is not. When i try to find it vector, an exception is raised due to this word is not in vocab. How can I solve this error by finding its vector?

Thank you

ali3assi on 15 Apr 2018

👍1

@TamouzeAssi
From my point of view, when using gensim to "load then save then load again" the fasttext model, the OOV words will NOT work in the model generated by gensim.
Actually I was using the pyfasttext package which works well for me(and it's much faster to use pyfasttext to load the model than gensim).
You can reference my code (Some comments are in Chinese, if you don't understand Chinese, just ignore the link).

lxw0109 on 16 Apr 2018

@lxw0109 thank you. I will use your work but please does your work can load the pretrained model like wiki.en generated by the origine fastText package?

ali3assi on 16 Apr 2018

@TamouzeAssi I did NOT test using pyfasttext to load pretrained model generated by fastText package, but pyfasttext DOES support loading pretrained model by fasttext command line tools and by pyfasttext itself, while fastText does NOT support loading pretrained model by fasttext command line tools(so I have to give up fastText package).

lxw0109 on 16 Apr 2018

This worked for me

KeyedVectors.load_word2vec_format(binary_file_path,
binary=True, encoding='utf-8', unicode_errors='ignore')

girijaravishankar on 24 Sep 2018

@girijaravishankar I am still having the error.
'utf8' codec can't decode byte 0xba in position 0: invalid start byte

getamu on 11 Oct 2018

😕1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How to recreate the English pretrained word vectors using enwik9

poppingtonic · 3Comments

Adding the bin file to wiki-news-300d-1M-subword.vec.zip

kurtjanssensai · 3Comments

About the input format of `fastext`

pengyu · 3Comments

Print out the best parameters from autotune

AhmedIdr · 3Comments

Which algorithm is being used for the classification task ?

a11apurva · 3Comments