Hi,
I get .vec file and .bin file trained by fastText,
but how to get the word vector out of vocabulary with python code?
from gensim.models import FastText as fText
'''
There are two files in directory "/home/jack/dev1.8t/models/vecs/":
zhwiki15-100_200dim.vec
zhwiki15-100_200dim.bin
'''
fastText_wv = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim")
fastText_wv.wv.most_similar("哈哈国")
from gensim.models.keyedvectors import KeyedVectors
zh_vec_model = KeyedVectors.load_word2vec_format('/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim.vec',binary=False)
zh_vec_model.most_similar("相似")
Hi @yudianer,
When using the Python bindings from the fastText repository, you can load a binary model (.bin) and then use the function get_word_vector (https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py#L47) to obtain a representation for out-of-vocabulary words.
I believe that when using gensim, you can also obtain representation for OoV words by using:
oov_vector = model[oov_word]
Best,
Edouard.
Oh, That is greate. Thank you. @EdouardGrave
hi @EdouardGrave , after test, I find the Oov words gotten from fastText_wv.wv.most_similar , gotten through that way that we first get the oov_vector and then query the Oov word via this vector and we query through command line via ./fasttext nn /home/jack/dev1.8t/models/vecs/zhwiki15-120.bin are different from each other .
Here is the codes and the results:
from gensim.models.keyedvectors import KeyedVectors
from gensim.models import FastText as fText
wv_model = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120")
fastvec = KeyedVectors.load_word2vec_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120.vec")
when we use fastText_wv.wv.most_similar, we get:
[('呼和诺日', 0.37319523096084595),
('transmembrane', 0.36256030201911926),
('POTEM', 0.3559618890285492),
('北代', 0.35170847177505493),
('APK', 0.3510173559188843),
('Cxcr', 0.3481011986732483),
('RPE', 0.34661561250686646),
('Subsonic', 0.34622472524642944),
('胡硕', 0.3440268635749817),
('大掌柜', 0.3420392870903015)]
but when we do this via vector, zh_vec_model.similar_by_vector(fastText_wv.wv.word_vec("乌兰牧")), we get:
[('资处', 0.4318895936012268),
('Welfare', 0.4047064185142517),
('receptors', 0.4015738368034363),
('transmembrane', 0.39108312129974365),
('三道河乡', 0.3902326822280884),
('Kaifong', 0.3857700228691101),
('969P', 0.3836411237716675),
('969C', 0.38015681505203247),
('transduction', 0.3797898292541504),
('卡东', 0.37969958782196045)]
and when we use the command line we get:
Query word? 呼和浩
呼和浩特 0.685481
呼和浩特站 0.662498
呼和浩特人 0.65472
呼和浩特市人 0.648876
呼和浩特市 0.623726
呼和浩特局 0.622628
内蒙古自治区 0.607049
呼和诺尔镇 0.566869
呼和诺尔 0.558815
内蒙古 0.547474
Why? Thank you very much!
Hi, I got the same problem. Anyone can give me some hints?
from gensim.models import FastText
model = FastText.load_fasttext_format('the modle bin file')
print(model.wv.get_vector('the word'))
i use the above code, i find that the vector is different from the origin vector
Same problem with all above ... Maybe we should comment a issue to Gensim ...
The problem reported by @yudianer is also addressed here: https://github.com/RaRe-Technologies/gensim/issues/2059
It is also recommendable to create a dictionary of the word vectors where the keys are the words and the values the vectors.
Following code is used with gensim's FastText module:
``` python
from gensim.models import FastText
import pickle
ft_model = FastText.load('model_path.model')
vocab = list(ft_model.wv.vocab)
word_to_vec_dict = {word: ft_model[word] for word in vocab}
with open('word2vec_dictionary.pickle', 'wb') as f:
pickle.dump(word_to_vec_dict , f, protocol=pickle.HIGHEST_PROTOCOL)
word_to_vec_dict["word"]
ft_model["word"]
````
@tuanle618 I was trying to get the vocab as you suggested.
However, I get this error:
Traceback (most recent call last):
File "fast_text_vocab.py", line 14, in <module>
vocab = list(ft_model.wv.vocab)
AttributeError: '_FastText' object has no attribute 'wv'
or
Traceback (most recent call last):
File "fast_text_vocab.py", line 14, in <module>
vocab = list(ft_model.vocab)
AttributeError: '_FastText' object has no attribute 'vocab'
@shubhamagarwal92 which version of fasttext are you using? It might have a different naming of the attributes in your version. Try out tab completion for the instance "ft_model" or print out dir(ft_model) to check attributes...
@shubhamagarwal92 , did u figure out the problem. I am also seeing the same 'Attribute Error'
You can try
model.get_word_vector("your_word")
and do read this fast word representation documentation: https://fasttext.cc/docs/en/unsupervised-tutorial.html
Most helpful comment
Hi @yudianer,
When using the Python bindings from the fastText repository, you can load a binary model (.bin) and then use the function
get_word_vector(https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py#L47) to obtain a representation for out-of-vocabulary words.I believe that when using gensim, you can also obtain representation for OoV words by using:
Best,
Edouard.