Fasttext: How to get vector of word out of vocabulary with python from model trained by fastText?

Created on 13 Mar 2018 · 13Comments · Source: facebookresearch/fastText

Hi,
I get .vec file and .bin file trained by fastText,
but how to get the word vector out of vocabulary with python code?

Source

yudianer

Most helpful comment

Hi @yudianer,

When using the Python bindings from the fastText repository, you can load a binary model (.bin) and then use the function get_word_vector (https://github.com/facebookresearch/fastText/blob/master/python/fastText/FastText.py#L47) to obtain a representation for out-of-vocabulary words.

I believe that when using gensim, you can also obtain representation for OoV words by using:

oov_vector = model[oov_word]

Best,
Edouard.

EdouardGrave on 14 Mar 2018

👍2

All 13 comments

1. Get similar words for the words out of vocabulary:

from gensim.models import FastText as fText
'''
 There are two files in directory "/home/jack/dev1.8t/models/vecs/":

      zhwiki15-100_200dim.vec
      zhwiki15-100_200dim.bin
'''
fastText_wv = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim") 
fastText_wv.wv.most_similar("哈哈国")

2. Get similar words for the words within vocabulary for normal usage:

from gensim.models.keyedvectors import KeyedVectors
zh_vec_model = KeyedVectors.load_word2vec_format('/home/jack/dev1.8t/models/vecs/zhwiki15-100_200dim.vec',binary=False)
zh_vec_model.most_similar("相似")

yudianer on 13 Mar 2018

Hi @yudianer,

I believe that when using gensim, you can also obtain representation for OoV words by using:

oov_vector = model[oov_word]

Best,
Edouard.

EdouardGrave on 14 Mar 2018

👍2

Oh, That is greate. Thank you. @EdouardGrave

yudianer on 14 Mar 2018

hi @EdouardGrave , after test, I find the Oov words gotten from fastText_wv.wv.most_similar , gotten through that way that we first get the oov_vector and then query the Oov word via this vector and we query through command line via ./fasttext nn /home/jack/dev1.8t/models/vecs/zhwiki15-120.bin are different from each other .
Here is the codes and the results:

from gensim.models.keyedvectors import KeyedVectors
from gensim.models import FastText as fText
wv_model = fText.load_fasttext_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120")
fastvec = KeyedVectors.load_word2vec_format("/home/jack/dev1.8t/models/vecs/zhwiki15-120.vec")

when we use fastText_wv.wv.most_similar, we get:

[('呼和诺日', 0.37319523096084595),
 ('transmembrane', 0.36256030201911926),
 ('POTEM', 0.3559618890285492),
 ('北代', 0.35170847177505493),
 ('APK', 0.3510173559188843),
 ('Cxcr', 0.3481011986732483),
 ('RPE', 0.34661561250686646),
 ('Subsonic', 0.34622472524642944),
 ('胡硕', 0.3440268635749817),
 ('大掌柜', 0.3420392870903015)]

but when we do this via vector, zh_vec_model.similar_by_vector(fastText_wv.wv.word_vec("乌兰牧")), we get:

[('资处', 0.4318895936012268),
 ('Welfare', 0.4047064185142517),
 ('receptors', 0.4015738368034363),
 ('transmembrane', 0.39108312129974365),
 ('三道河乡', 0.3902326822280884),
 ('Kaifong', 0.3857700228691101),
 ('969P', 0.3836411237716675),
 ('969C', 0.38015681505203247),
 ('transduction', 0.3797898292541504),
 ('卡东', 0.37969958782196045)]

and when we use the command line we get:

Query word? 呼和浩
呼和浩特 0.685481
呼和浩特站 0.662498
呼和浩特人 0.65472
呼和浩特市人 0.648876
呼和浩特市 0.623726
呼和浩特局 0.622628
内蒙古自治区 0.607049
呼和诺尔镇 0.566869
呼和诺尔 0.558815
内蒙古 0.547474

Why? Thank you very much!

yudianer on 21 Mar 2018

Hi, I got the same problem. Anyone can give me some hints?

Jhangsy on 5 Jul 2018

from gensim.models import FastText
model = FastText.load_fasttext_format('the modle bin file')
print(model.wv.get_vector('the word'))

i use the above code, i find that the vector is different from the origin vector

qujinqiang on 22 Aug 2018

Same problem with all above ... Maybe we should comment a issue to Gensim ...

GeneZC on 20 Oct 2018

The problem reported by @yudianer is also addressed here: https://github.com/RaRe-Technologies/gensim/issues/2059

mpenkov on 8 Dec 2018

It is also recommendable to create a dictionary of the word vectors where the keys are the words and the values the vectors.
Following code is used with gensim's FastText module:

``` python
from gensim.models import FastText
import pickle

Load trained FastText model

ft_model = FastText.load('model_path.model')

Get vocabulary of FastText model

vocab = list(ft_model.wv.vocab)

Get word2vec dictionary

word_to_vec_dict = {word: ft_model[word] for word in vocab}

Save dictionary for later usage

with open('word2vec_dictionary.pickle', 'wb') as f:
pickle.dump(word_to_vec_dict , f, protocol=pickle.HIGHEST_PROTOCOL)

Retrieve a word

word_to_vec_dict["word"]
ft_model["word"]

Should be the same

````

tuanle618 on 16 Jan 2019

@tuanle618 I was trying to get the vocab as you suggested.

However, I get this error:

Traceback (most recent call last):
  File "fast_text_vocab.py", line 14, in <module>
    vocab = list(ft_model.wv.vocab)
AttributeError: '_FastText' object has no attribute 'wv'

Traceback (most recent call last):
  File "fast_text_vocab.py", line 14, in <module>
    vocab = list(ft_model.vocab)
AttributeError: '_FastText' object has no attribute 'vocab'

shubhamagarwal92 on 22 Jun 2019

@shubhamagarwal92 which version of fasttext are you using? It might have a different naming of the attributes in your version. Try out tab completion for the instance "ft_model" or print out dir(ft_model) to check attributes...

tuanle618 on 23 Jun 2019

👍1

@shubhamagarwal92 , did u figure out the problem. I am also seeing the same 'Attribute Error'