Fasttext: How to get vector of word out of vocabulary with python from model trained by fastText?

Created on 13 Mar 2018  ·  13Comments  ·  Source: facebookresearch/fastText

Hi,
I get .vec file and .bin file trained by fastText,
but how to get the word vector out of vocabulary with python code?

Most helpful comment

Hi @yudianer,

When using the Python bindings from the fastText repository, you can load a binary model (.bin) and then use the function get_word_vector (https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py#L47) to obtain a representation for out-of-vocabulary words.

I believe that when using gensim, you can also obtain representation for OoV words by using:

oov_vector = model[oov_word]

Best,
Edouard.

All 13 comments

1. Get similar words for the words out of vocabulary:

from gensim.models import FastText as fText
'''
 There are two files in directory "/home/jack/dev1.8t/models/vecs/":

      zhwiki15-100_200dim.vec
      zhwiki15-100_200dim.bin
'''
fastText_wv = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim") 
fastText_wv.wv.most_similar("哈哈国")

2. Get similar words for the words within vocabulary for normal usage:

from gensim.models.keyedvectors import KeyedVectors
zh_vec_model = KeyedVectors.load_word2vec_format('/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim.vec',binary=False)
zh_vec_model.most_similar("相似")

Hi @yudianer,

When using the Python bindings from the fastText repository, you can load a binary model (.bin) and then use the function get_word_vector (https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py#L47) to obtain a representation for out-of-vocabulary words.

I believe that when using gensim, you can also obtain representation for OoV words by using:

oov_vector = model[oov_word]

Best,
Edouard.

Oh, That is greate. Thank you. @EdouardGrave

hi @EdouardGrave , after test, I find the Oov words gotten from fastText_wv.wv.most_similar , gotten through that way that we first get the oov_vector and then query the Oov word via this vector and we query through command line via ./fasttext nn /home/jack/dev1.8t/models/vecs/zhwiki15-120.bin are different from each other .
Here is the codes and the results:

from gensim.models.keyedvectors import KeyedVectors
from gensim.models import FastText as fText
wv_model = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120")
fastvec = KeyedVectors.load_word2vec_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120.vec")

when we use fastText_wv.wv.most_similar, we get:

[('呼和诺日', 0.37319523096084595),
 ('transmembrane', 0.36256030201911926),
 ('POTEM', 0.3559618890285492),
 ('北代', 0.35170847177505493),
 ('APK', 0.3510173559188843),
 ('Cxcr', 0.3481011986732483),
 ('RPE', 0.34661561250686646),
 ('Subsonic', 0.34622472524642944),
 ('胡硕', 0.3440268635749817),
 ('大掌柜', 0.3420392870903015)]

but when we do this via vector, zh_vec_model.similar_by_vector(fastText_wv.wv.word_vec("乌兰牧")), we get:

[('资处', 0.4318895936012268),
 ('Welfare', 0.4047064185142517),
 ('receptors', 0.4015738368034363),
 ('transmembrane', 0.39108312129974365),
 ('三道河乡', 0.3902326822280884),
 ('Kaifong', 0.3857700228691101),
 ('969P', 0.3836411237716675),
 ('969C', 0.38015681505203247),
 ('transduction', 0.3797898292541504),
 ('卡东', 0.37969958782196045)]

and when we use the command line we get:

Query word? 呼和浩
呼和浩特 0.685481
呼和浩特站 0.662498
呼和浩特人 0.65472
呼和浩特市人 0.648876
呼和浩特市 0.623726
呼和浩特局 0.622628
内蒙古自治区 0.607049
呼和诺尔镇 0.566869
呼和诺尔 0.558815
内蒙古 0.547474

Why? Thank you very much!

Hi, I got the same problem. Anyone can give me some hints?

from gensim.models import FastText
model = FastText.load_fasttext_format('the modle bin file')
print(model.wv.get_vector('the word'))

i use the above code, i find that the vector is different from the origin vector

Same problem with all above ... Maybe we should comment a issue to Gensim ...

The problem reported by @yudianer is also addressed here: https://github.com/RaRe-Technologies/gensim/issues/2059

It is also recommendable to create a dictionary of the word vectors where the keys are the words and the values the vectors.
Following code is used with gensim's FastText module:

``` python
from gensim.models import FastText
import pickle

Load trained FastText model

ft_model = FastText.load('model_path.model')

Get vocabulary of FastText model

vocab = list(ft_model.wv.vocab)

Get word2vec dictionary

word_to_vec_dict = {word: ft_model[word] for word in vocab}

Save dictionary for later usage

with open('word2vec_dictionary.pickle', 'wb') as f:
pickle.dump(word_to_vec_dict , f, protocol=pickle.HIGHEST_PROTOCOL)

Retrieve a word

word_to_vec_dict["word"]
ft_model["word"]

Should be the same

````

@tuanle618 I was trying to get the vocab as you suggested.

However, I get this error:

Traceback (most recent call last):
  File "fast_text_vocab.py", line 14, in <module>
    vocab = list(ft_model.wv.vocab)
AttributeError: '_FastText' object has no attribute 'wv'

or

Traceback (most recent call last):
  File "fast_text_vocab.py", line 14, in <module>
    vocab = list(ft_model.vocab)
AttributeError: '_FastText' object has no attribute 'vocab'

@shubhamagarwal92 which version of fasttext are you using? It might have a different naming of the attributes in your version. Try out tab completion for the instance "ft_model" or print out dir(ft_model) to check attributes...

@shubhamagarwal92 , did u figure out the problem. I am also seeing the same 'Attribute Error'

You can try
model.get_word_vector("your_word")

and do read this fast word representation documentation: https://fasttext.cc/docs/en/unsupervised-tutorial.html

Was this page helpful?
0 / 5 - 0 ratings