Gensim: Loading fastText binary output to gensim like word2vec

Created on 5 Aug 2016 · 49Comments · Source: RaRe-Technologies/gensim

Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.

need info

Source

phunterlau

Most helpful comment

It appears the .vec output of fastText is already compatible with the original word2vec.c text format, and readable in gensim by load_word2vec_format(filename, binary=False).

The .bin output, written in parallel (rather than as an alternative format like in word2vec.c), seems to have extra info – such as the vectors for char-ngrams – that wouldn't map directly into gensim models unless/until they're extended with new features. So supporting the load of such info isn't a mere matter of format understanding/translation.

gojomo on 6 Aug 2016

👍11

All 49 comments

Definitely! Reading/writing the fastText word-vector format (the .vec and perhaps consulting part of the .bin) would be an obvious 1st step.

(As a later step, the .bin might include enough extra info for models to continue training… though their classification modes might not map directly to the existing gensim output-layer models.)

gojomo on 6 Aug 2016

👍1

It appears the .vec output of fastText is already compatible with the original word2vec.c text format, and readable in gensim by load_word2vec_format(filename, binary=False).

gojomo on 6 Aug 2016

👍11

@gojomo thanks. I can confirm the vec format is compatible with gensim

phunterlau on 6 Aug 2016

See FastText comparison notebook in https://github.com/RaRe-Technologies/gensim/pull/815

tmylk on 10 Aug 2016

@tmylk - we may want to keep this open for the larger issue of doing something with the .bin output. We might be able to map its weights (and word-frequency info) into gensim's objects, to support continued training, as a small translation-of-values patch.

Loading the buckets-of-subword-vectors, and making them usable for OOV prediction, would require a bit more actual functionality... but would still be practical. Maybe at first, the subword-buckets wind up in a different class – even perhaps a KeyedVectors variant/sibling - which would offer both subword vector lookup, by hashed key, and word-vector-reconstruction (incl. OOV words), by composition of subword vectors. (cc @droudy)

gojomo on 20 Aug 2016

👍4

Implemented in https://github.com/RaRe-Technologies/gensim/blob/2a70e3a726404cd4230542a35cfd2dc4d63da6f1/gensim/models/wrappers/fasttext.py#L246

tmylk on 24 Jan 2017

FastText just published pre-trained word vectors for 90 languages trained on Wikipedia. I am trying to load the Spanish, Basque or English models with gensim=1.0.0 and the method FastText.load_fasttext_format but I have the following error:

File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
    self.load_dict(f)
  File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 282, in load_dict
    char = char.decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Should I use some other method?

AritzBi on 1 Mar 2017

cc @jayantj @tmylk We should investigate about these new pre-trained models. Why do they fail tests. And also add these model files into tests.

anmolgulati on 1 Mar 2017

Thanks for reporting this issue. At first glance, it seems like the code makes an assumption that the characters constituting the vocab words can be decoded as ascii. That would be a dangerous assumption to make. Looking into it further.

And yes, adding these model files (or maybe simply models with non-ascii characters, and possibly even utf-16/utf-32 characters) to tests would be a good idea. Will do as soon as I get to the root of this issue.

jayantj on 1 Mar 2017

Yes, confirming that this is the issue. I'll push a fix for this asap.

jayantj on 1 Mar 2017

Fix pushed as part of #1176
I haven't been able to test loading the new pre-trained models yet, since they are rather large (~10 GB) and the download is taking forever.

jayantj on 1 Mar 2017

I've just tested @jayantj fix with Spanish and Basque models and they are properly loaded. Thanks for the quick fix!!

AritzBi on 2 Mar 2017

👍1

I tried the pre-trained English model, with the following command:

fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')

I get the following error:

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-e2cc3eaf9300> in <module>()
      1 # We use the FastText wrapper from Gensim.
      2 # Download the vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
----> 3 fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')
      4 
      5 # Alternatively:

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
    236         model = cls()
    237         model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238         model.load_binary_data('%s.bin' % model_file)
    239         return model
    240 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
    253         with utils.smart_open(model_binary_file, 'rb') as f:
    254             self.load_model_params(f)
--> 255             self.load_dict(f)
    256             self.load_vectors(f)
    257 

/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
    280             word = ''
    281             char, = self.struct_unpack(file_handle, '@c')
--> 282             char = char.decode()
    283             # Read vocab word
    284             while char != '\x00':

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data

This is with Gensim 1.0.0, freshly installed from PyPi, on OS X. The following code did work (but doesn't load the binary file):

fasttext = Word2Vec.load_word2vec_format('/Users/Emiel/Downloads/wiki.en/wiki.en.vec', binary=False)

Does @jayantj's fix also solve this issue? If so, should I install Gensim from GitHub, or will the patch soon also be on PyPi?

evanmiltenburg on 3 Mar 2017

Please install from github for now

tmylk on 3 Mar 2017

Fixed in gensim 1.0.1 available on PyPI

tmylk on 4 Mar 2017

👍2

Fixed in #1176

tmylk on 6 Mar 2017

Hi,
Not sure if this is the right place to put up my doubt, but asking that anyway.
Does trained .bin file of fasttext contains ngram vectors of sizes [3-6] only or ngram vectors of all sizes ? Upon loading the model with gensim and iterating through ngrams, I found that ngrams of all sizes are present.
If ngrams of all sizes are present, then my another doubt is which ngrams are used to make vector of out of vocabulary word. ngrams of sizes [3-6] or all ?

already-taken-m17 on 14 Mar 2017

Hi @already-taken-m17
It depends on the hyperparameters the model was trained with - the default values of min_n and max_n are 3 and 6.
Which model are you loading, and how exactly are you iterating through ngrams?

For out-of-vocabulary words, again, ngrams of sizes [min_n,max_n] are used (and only those ngrams which were present in the ngram vocabulary of the training data)

jayantj on 14 Mar 2017

Hi @jayantj , thanks for the reply.
I trained the model with original C++ fasttext implementation fastText Github repo
I used following command to train:
./fasttext skipgram -input data.txt -output model
This must take default parameters and store ngrams of sizes [3,6].
For iterating through ngrams, I am using model.wv.ngrams after loading the model using fasttext wrapper of gensim.

already-taken-m17 on 15 Mar 2017

@jayantj I'm getting the same error when trying to load fasttext pre-trained "wiki.he.bin" using this command:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, unicode_errors='ignore')

I'm getting this error:

return unicode(text, encoding, errors=errors)
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte

dimeldo on 23 Mar 2017

@jayantj @tmylk we have a new 10TB disk on h2 -- feel free to download these "large fastText files" there, for testing.

piskvorky on 23 Mar 2017

The .bin file is not in word2vec binary format. Use the .vec file and load it with the flag binary=False. Or use the FastText wrapper:

from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format(filename_without_extension)

According to the source, this is shorthand for:

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('FILENAME.vec')
model.load_binary_data('FILENAME.bin')

The binary data is specific to the FastText algorithm.

Edit: in response to @dimeldo.

evanmiltenburg on 23 Mar 2017

👍2

@evanmiltenburg will definitely try that, thanks.

Edit: It worked well, thanks.

dimeldo on 23 Mar 2017

Thanks for the update, good to hear it worked.

jayantj on 23 Mar 2017

I tried to load using the load_fasttext_format function, which takes forever to read the .vec file. It seems like fastText manages to load all vectors by just reading .bin file, which is much faster. Would it be possible to avoid reading the .vec file for loading when there is a .bin file? I'm only asking since I prefer to use gensim for both word2vec and fasttext instead of adapting another library. Thanks.

jdchoi77 on 27 Mar 2017

👍1

@tmylk @jayantj how does that SaleStock load its data? We definitely don't want to be slower than other Python tools/wrappers.

piskvorky on 4 Apr 2017

SaleStock is using C++ code closer to original FastText. Created #1261 wishlist issue.

tmylk on 5 Apr 2017

If it's really that annoyingly slow, we could read the code in C (Cython-compiled) -- seems easy enough.

piskvorky on 9 Apr 2017

@piskvorky @tmylk Cythonizing would be fine, but as of now, we are first getting info from .vec file and then while reading .bin file, we use assert statement to confirm that there is no mismatch in the info obtained from .vec and .bin file. This seems unnecessary.

prakhar2b on 10 Apr 2017

Yes, it should be possible to load the model only from the .bin file without having to read the .vec file.
As of now, the .vec file is used to initialize the KeyedVectors instance, which include:

The vocabulary (words and counts)
The vectors for in-vocab words

The .bin file contains the in-vocab words too (loaded in FastText.load_dict), however the word vectors will have to be initialized by making use of the char-ngram vectors. They are not directly present in the .bin file.

Changing this would require non-trivial changes to the FastText class. It could be useful to do some quick profiling to see whether loading the .vec file takes up a significant portion of time before expending effort on changing this behaviour (ideally, for models of different sizes - say 50 MB, 500 MB, 5 GB)

jayantj on 10 Apr 2017

🎉1

I am having some issues reading fasttext files.

from gensim.models import KeyedVectors
no_model = KeyedVectors.load_word2vec_format('wiki.no/wiki.no.vec')

The above code works, but with that I'm not able to get oov words.

from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('wiki.no/wiki.no.vec')
model.load_binary_data('wiki.no/wiki.no.bin')

With the above code I get the error:

AttributeError: 'FastTextKeyedVectors' object has no attribute 'load_binary_data'

There are no examples in the documentation as to the best way to read a fasttext file and get oov vectors.

arashsa on 25 Apr 2017

@tmylk if so, please update the docs with your team.

piskvorky on 25 Apr 2017

@arashsa The code has been updated since the comment above, please use load_fasttext_format now

tmylk on 27 Apr 2017

@tmylk when I try the load_tasttext_format method I get this error:

AssertionError: mismatch between vocab sizes

arashsa on 7 May 2017

@arashsa There is an actual mismatch in the sizes of vocab in .vec and .bin sizes. So it is possible it's there for Norwegian. Please report it to FastText for french.

tmylk on 8 May 2017

Does it support online training?

rajivgrover009 on 19 Dec 2017

@rajivgrover009 our fasttext implementation - yes.

menshikh-iv on 20 Dec 2017

@menshikh-iv i meant continue training out pretrained fasttext models. is it possible to use pretrained models by fasttext and continue training to add use case specific vocabulary??

rajivgrover009 on 20 Dec 2017

@anmolgulati show me a concrete link please, probably, only main matrices saved, i.e. - you can only use it (but can't continue training).

menshikh-iv on 21 Dec 2017

Trying to load Bengali Fastext model, both for .vec and .bin format

For both .bin and .vec formats white trying with models.KeyedVectors.load_word2vec_format( )

ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try git clean -xdf (removes all
files not under version control). Otherwise reinstall numpy.
Same error when trying FastText.load_vectors( ) or FastText.load_binary_data( )

sauravm8 on 15 Jul 2018

@sauravm8 you have issues with numpy installation, resolve it first and reinstall gensim after

for .bin use https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format (this typically contains full model with parameters, ngrams, etc), you can continue training after loading
for .vec use https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model)

menshikh-iv on 15 Jul 2018

👍1

@menshikh-iv Yes. It was a version conflict between numpy and python. Solved.

sauravm8 on 15 Jul 2018

@menshikh-iv Hi Sir,
I tried to load a model (faq.model.bin which is trained using fasttext) using gensim wrapper, code I used for loading the model :

import os
from nltk.tokenize import word_tokenize, sent_tokenize
from pprint import pprint
import re
from textblob import TextBlob
import string
from nltk.corpus import stopwords
from gensim.models import Word2Vec

import gensim
from gensim.models import word2vec, KeyedVectors
from threading import Semaphore
import logging

import numpy as np
import os

vector_dim = 300
root_path = os.getcwd()
from nltk.tokenize import word_tokenize
import multiprocessing

def readStr():
    return raw_input().strip()

if __name__ == "__main__":
    from gensim.models.wrappers import FastText
    file1 = open("fasttext_finance.txt","w")
    print "Loading the model"
    model_path = "/home/akash/Documents/nlp-tests/models/finance_model/faq.model.bin"
    model = FastText.load_fasttext_format(model_path)
    # model = gensim.models.fasttext.load_fasttext_format(model_path)
    print(model.most_similar('banks'))

    lis = ["income","maturity","tax","mutual","fund","banks","cash","pf","epf","bankrupt",
            "loans","money","benefit","insurance","debt","advantage","sbi","kotak","shares",
            "food","hotel","retirement","travel","food","health","salary","account","advantage","disadvantage"]

    for word in lis:
        N = 50
        print("Most similar words to {} are :{}\n".format(word,model.most_similar(positive=[word],topn=N)))
        file1.write("Most similar words to {} are :{}\n\n".format(word,model.most_similar(positive=[word],topn=N)))
    file1.close()

and here is the error I receive:

/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py:410: RuntimeWarning: divide by zero encountered in remainder
  ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
Traceback (most recent call last):
  File "testing_merged_all.py", line 31, in <module>
    model = FastText.load_fasttext_format(model_path)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 271, in load_fasttext_format
    model.load_binary_data(encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 297, in load_binary_data
    self.load_vectors(f)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 384, in load_vectors
    self.init_ngrams()
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 412, in init_ngrams
    self.wv.syn0_ngrams = self.wv.syn0_ngrams.take(ngram_indices, axis=0)
IndexError: index 2534933 is out of bounds for size 2534933
Is it memory error or something else?

Apart from the above, here is the script I used for training a pretrained model on my own corpus and generated the model which I used in above script for testing it. Here goes the script:

./fasttext supervised \
  -pretrainedVectors /home/akash/Downloads/wiki.en.vec \
  -input output.txt \
  -dim 300 \
  -output faq.model

Have I not trained the pretrained model correctly or where is the error? As the same model(faq.model.bin) is working fine with the pyfasttext library. Please look into this.

harrypotter0 on 23 Jul 2018

@harrypotter0 we don't support "supervised" models, can you share your .bin please, I want to reproduce this issue?

menshikh-iv on 31 Jul 2018

How could i save fast text model to bin and vec files ?

Sherriiie on 4 Sep 2018

@Sherriiie If you download the original pre-trained files from official website of fasttext at here: https://fasttext.cc/docs/en/english-vectors.html and unzip them, they are .vec files.
Regarding .bin file, usually I use gensim to transform vec to bin like this:
vec_file = gensim.models.KeyedVectors.load_word2vec_format("crawl_300d_2M.vec", binary=False) vec_file.save_word2vec_format("crawl_300d_2M.bin", binary=True).
It is much faster by using binary files.

kevin28520 on 4 Sep 2018

@kevin28520 note (it can be obvious, but to avoid confusion) than .bin produced in this way are different with .bin distributed by FB: proposed .bin will contain ONLY word-vectors (no ngrams), this is still equivalent of .vec file distributed by FB.

menshikh-iv on 5 Sep 2018

👍2

I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html
Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?

import gensim.models.keyedvectors as word2vec1
from scipy import spatial
from gensim.models import FastText

pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec'
embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors)
gg = embed_map.wv.get_vector('سائیکل')
hh = embed_map.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 1.8220717906951904
when i load .bin file

model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin')
gg = model.wv.get_vector('سائیکل')
hh = model.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)

i got similarity score 0.376111775636673