python3
import fastText as fasttext
import os
import numpy as np
pwd = os.getcwd()
model_bin = "/home/bringtree/data/wiki.zh.bin"
model_vec = "/home/bringtree/data/wiki.zh.vec"
model = fasttext.load_model(model_bin)
word_1 = model.get_word_vector('asdhasjhdkajshd')
print(word_1[:20])
[-0.10704836 -0.5085796 -0.05533567 -0.45416433 0.36912176 -0.04111901
-0.3435909 -0.13083233 0.07110099 -0.23444724 0.26429185 0.31326798
0.20615076 -0.23127083 -0.11359369 0.21303149 -0.19785886 0.32893217
-0.14822693 0.02602408]
"asdhasjhdkajshd" is not in the train set. And i want to know how do the model predict it?
FT breaks down each word into a bag of n-grams of chars, like
'awesome' => <aw>, <awe>, <wes>, <eso>, <som>, <ome>, <me>
if we set minn = maxn = 3
each subword n-grams are assigned a vector value when an OOV(out of vocabulary) word is encountered FT will try and build a vector by summing up subword vectors that would make up the word, so if you try to get a vector for awme then a vector sum of subwords <aw> and <me> is returned.
This is what makes FT robust in dealing with misspelled words and internet slag.
Also subword vector is not same as word vector <me> != me
you can get your subwords with model.get_subwords('asdhasjhdkajshd')
FT unsupervised model is based on this paper Enriching Word Vectors with Subword Information
Most helpful comment
FT breaks down each word into a bag of n-grams of chars, like
'awesome' => <aw>, <awe>, <wes>, <eso>, <som>, <ome>, <me>if we set minn = maxn = 3
each subword n-grams are assigned a vector value when an OOV(out of vocabulary) word is encountered FT will try and build a vector by summing up subword vectors that would make up the word, so if you try to get a vector for
awmethen a vector sum of subwords<aw>and<me>is returned.This is what makes FT robust in dealing with misspelled words and internet slag.
Also subword vector is not same as word vector
<me> != meyou can get your subwords with
model.get_subwords('asdhasjhdkajshd')FT unsupervised model is based on this paper Enriching Word Vectors with Subword Information