Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.
Definitely! Reading/writing the fastText word-vector format (the .vec and perhaps consulting part of the .bin) would be an obvious 1st step.
(As a later step, the .bin might include enough extra info for models to continue training… though their classification modes might not map directly to the existing gensim output-layer models.)
It appears the .vec output of fastText is already compatible with the original word2vec.c text format, and readable in gensim by load_word2vec_format(filename, binary=False).
The .bin output, written in parallel (rather than as an alternative format like in word2vec.c), seems to have extra info – such as the vectors for char-ngrams – that wouldn't map directly into gensim models unless/until they're extended with new features. So supporting the load of such info isn't a mere matter of format understanding/translation.
@gojomo thanks. I can confirm the vec format is compatible with gensim
See FastText comparison notebook in https://github.com/RaRe-Technologies/gensim/pull/815
@tmylk - we may want to keep this open for the larger issue of doing something with the .bin output. We might be able to map its weights (and word-frequency info) into gensim's objects, to support continued training, as a small translation-of-values patch.
Loading the buckets-of-subword-vectors, and making them usable for OOV prediction, would require a bit more actual functionality... but would still be practical. Maybe at first, the subword-buckets wind up in a different class – even perhaps a KeyedVectors variant/sibling - which would offer both subword vector lookup, by hashed key, and word-vector-reconstruction (incl. OOV words), by composition of subword vectors. (cc @droudy)
FastText just published pre-trained word vectors for 90 languages trained on Wikipedia. I am trying to load the Spanish, Basque or English models with gensim=1.0.0 and the method FastText.load_fasttext_format but I have the following error:
File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 255, in load_binary_data
self.load_dict(f)
File "/home/aritzbi/test_gensim_fastext/venv/local/lib/python2.7/site-packages/gensim/models/wrappers/fasttext.py", line 282, in load_dict
char = char.decode()
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Should I use some other method?
cc @jayantj @tmylk We should investigate about these new pre-trained models. Why do they fail tests. And also add these model files into tests.
Thanks for reporting this issue. At first glance, it seems like the code makes an assumption that the characters constituting the vocab words can be decoded as ascii. That would be a dangerous assumption to make. Looking into it further.
And yes, adding these model files (or maybe simply models with non-ascii characters, and possibly even utf-16/utf-32 characters) to tests would be a good idea. Will do as soon as I get to the root of this issue.
Yes, confirming that this is the issue. I'll push a fix for this asap.
Fix pushed as part of #1176
I haven't been able to test loading the new pre-trained models yet, since they are rather large (~10 GB) and the download is taking forever.
I've just tested @jayantj fix with Spanish and Basque models and they are properly loaded. Thanks for the quick fix!!
I tried the pre-trained English model, with the following command:
fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')
I get the following error:
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-3-e2cc3eaf9300> in <module>()
1 # We use the FastText wrapper from Gensim.
2 # Download the vectors from https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
----> 3 fasttext = FastText.load_fasttext_format('/Users/Emiel/Downloads/wiki.en/wiki.en')
4
5 # Alternatively:
/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_fasttext_format(cls, model_file)
236 model = cls()
237 model.wv = cls.load_word2vec_format('%s.vec' % model_file)
--> 238 model.load_binary_data('%s.bin' % model_file)
239 return model
240
/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_binary_data(self, model_binary_file)
253 with utils.smart_open(model_binary_file, 'rb') as f:
254 self.load_model_params(f)
--> 255 self.load_dict(f)
256 self.load_vectors(f)
257
/Users/Emiel/anaconda3/lib/python3.6/site-packages/gensim/models/wrappers/fasttext.py in load_dict(self, file_handle)
280 word = ''
281 char, = self.struct_unpack(file_handle, '@c')
--> 282 char = char.decode()
283 # Read vocab word
284 while char != '\x00':
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 0: unexpected end of data
This is with Gensim 1.0.0, freshly installed from PyPi, on OS X. The following code did work (but doesn't load the binary file):
fasttext = Word2Vec.load_word2vec_format('/Users/Emiel/Downloads/wiki.en/wiki.en.vec', binary=False)
Does @jayantj's fix also solve this issue? If so, should I install Gensim from GitHub, or will the patch soon also be on PyPi?
Please install from github for now
Fixed in gensim 1.0.1 available on PyPI
Fixed in #1176
Hi,
Not sure if this is the right place to put up my doubt, but asking that anyway.
Does trained .bin file of fasttext contains ngram vectors of sizes [3-6] only or ngram vectors of all sizes ? Upon loading the model with gensim and iterating through ngrams, I found that ngrams of all sizes are present.
If ngrams of all sizes are present, then my another doubt is which ngrams are used to make vector of out of vocabulary word. ngrams of sizes [3-6] or all ?
Hi @already-taken-m17
It depends on the hyperparameters the model was trained with - the default values of min_n and max_n are 3 and 6.
Which model are you loading, and how exactly are you iterating through ngrams?
For out-of-vocabulary words, again, ngrams of sizes [min_n,max_n] are used (and only those ngrams which were present in the ngram vocabulary of the training data)
Hi @jayantj , thanks for the reply.
I trained the model with original C++ fasttext implementation fastText Github repo
I used following command to train:
./fasttext skipgram -input data.txt -output model
This must take default parameters and store ngrams of sizes [3,6].
For iterating through ngrams, I am using model.wv.ngrams after loading the model using fasttext wrapper of gensim.
@jayantj I'm getting the same error when trying to load fasttext pre-trained "wiki.he.bin" using this command:
embedding_dict = gensim.models.KeyedVectors.load_word2vec_format(args.embedding_dictionary, binary=True, unicode_errors='ignore')
I'm getting this error:
return unicode(text, encoding, errors=errors)
File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 32: invalid start byte
@jayantj @tmylk we have a new 10TB disk on h2 -- feel free to download these "large fastText files" there, for testing.
The .bin file is not in word2vec binary format. Use the .vec file and load it with the flag binary=False. Or use the FastText wrapper:
from gensim.models.wrappers import FastText
model = FastText.load_fasttext_format(filename_without_extension)
According to the source, this is shorthand for:
from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('FILENAME.vec')
model.load_binary_data('FILENAME.bin')
The binary data is specific to the FastText algorithm.
Edit: in response to @dimeldo.
@evanmiltenburg will definitely try that, thanks.
Edit: It worked well, thanks.
Thanks for the update, good to hear it worked.
I tried to load using the load_fasttext_format function, which takes forever to read the .vec file. It seems like fastText manages to load all vectors by just reading .bin file, which is much faster. Would it be possible to avoid reading the .vec file for loading when there is a .bin file? I'm only asking since I prefer to use gensim for both word2vec and fasttext instead of adapting another library. Thanks.
@tmylk @jayantj how does that SaleStock load its data? We definitely don't want to be slower than other Python tools/wrappers.
SaleStock is using C++ code closer to original FastText. Created #1261 wishlist issue.
If it's really that annoyingly slow, we could read the code in C (Cython-compiled) -- seems easy enough.
@piskvorky @tmylk Cythonizing would be fine, but as of now, we are first getting info from .vec file and then while reading .bin file, we use assert statement to confirm that there is no mismatch in the info obtained from .vec and .bin file. This seems unnecessary.
Yes, it should be possible to load the model only from the .bin file without having to read the .vec file.
As of now, the .vec file is used to initialize the KeyedVectors instance, which include:
The .bin file contains the in-vocab words too (loaded in FastText.load_dict), however the word vectors will have to be initialized by making use of the char-ngram vectors. They are not directly present in the .bin file.
Changing this would require non-trivial changes to the FastText class. It could be useful to do some quick profiling to see whether loading the .vec file takes up a significant portion of time before expending effort on changing this behaviour (ideally, for models of different sizes - say 50 MB, 500 MB, 5 GB)
I am having some issues reading fasttext files.
from gensim.models import KeyedVectors
no_model = KeyedVectors.load_word2vec_format('wiki.no/wiki.no.vec')
The above code works, but with that I'm not able to get oov words.
from gensim.models.wrappers import FastText
model = FastText.load_word2vec_format('wiki.no/wiki.no.vec')
model.load_binary_data('wiki.no/wiki.no.bin')
With the above code I get the error:
AttributeError: 'FastTextKeyedVectors' object has no attribute 'load_binary_data'
There are no examples in the documentation as to the best way to read a fasttext file and get oov vectors.
@tmylk if so, please update the docs with your team.
@arashsa The code has been updated since the comment above, please use load_fasttext_format now
@tmylk when I try the load_tasttext_format method I get this error:
AssertionError: mismatch between vocab sizes
@arashsa There is an actual mismatch in the sizes of vocab in .vec and .bin sizes. So it is possible it's there for Norwegian. Please report it to FastText for french.
Does it support online training?
@rajivgrover009 our fasttext implementation - yes.
@menshikh-iv i meant continue training out pretrained fasttext models. is it possible to use pretrained models by fasttext and continue training to add use case specific vocabulary??
@anmolgulati show me a concrete link please, probably, only main matrices saved, i.e. - you can only use it (but can't continue training).
Trying to load Bengali Fastext model, both for .vec and .bin format
For both .bin and .vec formats white trying with models.KeyedVectors.load_word2vec_format( )
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, trygit clean -xdf(removes all
files not under version control). Otherwise reinstall numpy.
Same error when trying FastText.load_vectors( ) or FastText.load_binary_data( )
@sauravm8 you have issues with numpy installation, resolve it first and reinstall gensim after
.bin use https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format (this typically contains full model with parameters, ngrams, etc), you can continue training after loading.vec use https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.Word2VecKeyedVectors.load_word2vec_format (this contains ONLY word-vectors -> no ngrams + you can't update an model)@menshikh-iv Yes. It was a version conflict between numpy and python. Solved.
@menshikh-iv Hi Sir,
I tried to load a model (faq.model.bin which is trained using fasttext) using gensim wrapper, code I used for loading the model :
import os
from nltk.tokenize import word_tokenize, sent_tokenize
from pprint import pprint
import re
from textblob import TextBlob
import string
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import gensim
from gensim.models import word2vec, KeyedVectors
from threading import Semaphore
import logging
import numpy as np
import os
vector_dim = 300
root_path = os.getcwd()
from nltk.tokenize import word_tokenize
import multiprocessing
def readStr():
return raw_input().strip()
if __name__ == "__main__":
from gensim.models.wrappers import FastText
file1 = open("fasttext_finance.txt","w")
print "Loading the model"
model_path = "/home/akash/Documents/nlp-tests/models/finance_model/faq.model.bin"
model = FastText.load_fasttext_format(model_path)
# model = gensim.models.fasttext.load_fasttext_format(model_path)
print(model.most_similar('banks'))
lis = ["income","maturity","tax","mutual","fund","banks","cash","pf","epf","bankrupt",
"loans","money","benefit","insurance","debt","advantage","sbi","kotak","shares",
"food","hotel","retirement","travel","food","health","salary","account","advantage","disadvantage"]
for word in lis:
N = 50
print("Most similar words to {} are :{}\n".format(word,model.most_similar(positive=[word],topn=N)))
file1.write("Most similar words to {} are :{}\n\n".format(word,model.most_similar(positive=[word],topn=N)))
file1.close()
and here is the error I receive:
/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py:410: RuntimeWarning: divide by zero encountered in remainder
ngram_indices.append(len(self.wv.vocab) + ngram_hash % self.bucket)
Traceback (most recent call last):
File "testing_merged_all.py", line 31, in <module>
model = FastText.load_fasttext_format(model_path)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 271, in load_fasttext_format
model.load_binary_data(encoding=encoding)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 297, in load_binary_data
self.load_vectors(f)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 384, in load_vectors
self.init_ngrams()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/deprecated/fasttext_wrapper.py", line 412, in init_ngrams
self.wv.syn0_ngrams = self.wv.syn0_ngrams.take(ngram_indices, axis=0)
IndexError: index 2534933 is out of bounds for size 2534933
Is it memory error or something else?
Apart from the above, here is the script I used for training a pretrained model on my own corpus and generated the model which I used in above script for testing it. Here goes the script:
./fasttext supervised \
-pretrainedVectors /home/akash/Downloads/wiki.en.vec \
-input output.txt \
-dim 300 \
-output faq.model
Have I not trained the pretrained model correctly or where is the error? As the same model(faq.model.bin) is working fine with the pyfasttext library. Please look into this.
@harrypotter0 we don't support "supervised" models, can you share your .bin please, I want to reproduce this issue?
How could i save fast text model to bin and vec files ?
@Sherriiie If you download the original pre-trained files from official website of fasttext at here: https://fasttext.cc/docs/en/english-vectors.html and unzip them, they are .vec files.
Regarding .bin file, usually I use gensim to transform vec to bin like this:
vec_file = gensim.models.KeyedVectors.load_word2vec_format("crawl_300d_2M.vec", binary=False)
vec_file.save_word2vec_format("crawl_300d_2M.bin", binary=True).
It is much faster by using binary files.
@kevin28520 note (it can be obvious, but to avoid confusion) than .bin produced in this way are different with .bin distributed by FB: proposed .bin will contain ONLY word-vectors (no ngrams), this is still equivalent of .vec file distributed by FB.
I am using fasttext pre-train model for urdu language https://fasttext.cc/docs/en/pretrained-vectors.html
Why i am getting different vectors from .bin and .vec file ? Which one should use to evaluate the model?
import gensim.models.keyedvectors as word2vec1
from scipy import spatial
from gensim.models import FastText
pathToBinVectors = 'C:/Users/admin/fasttextwiki/wiki.ur.vec'
embed_map = word2vec1.KeyedVectors.load_word2vec_format(pathToBinVectors)
gg = embed_map.wv.get_vector('سائیکل')
hh = embed_map.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)
i got similarity score 1.8220717906951904
when i load .bin file
model = FastText.load_fasttext_format('C:/Users/admin/fasttextwiki/wiki.ur.bin')
gg = model.wv.get_vector('سائیکل')
hh = model.wv.get_vector('گاڑی')
a=1-spatial.distance.cosine(gg,hh)
print(a*4)
i got similarity score 0.376111775636673
@ghazeefa please open a new ticket for your problem
Most helpful comment
It appears the
.vecoutput of fastText is already compatible with the original word2vec.c text format, and readable in gensim byload_word2vec_format(filename, binary=False).The
.binoutput, written in parallel (rather than as an alternative format like in word2vec.c), seems to have extra info – such as the vectors for char-ngrams – that wouldn't map directly into gensim models unless/until they're extended with new features. So supporting the load of such info isn't a mere matter of format understanding/translation.