Fasttext: How fasttext predicts the words no in the train set?

Created on 5 Apr 2018  路  1Comment  路  Source: facebookresearch/fastText

python3

import fastText as fasttext
import os
import numpy as np
pwd = os.getcwd()
model_bin = "/home/bringtree/data/wiki.zh.bin"
model_vec = "/home/bringtree/data/wiki.zh.vec"
model = fasttext.load_model(model_bin)
word_1 = model.get_word_vector('asdhasjhdkajshd')
print(word_1[:20])
[-0.10704836 -0.5085796  -0.05533567 -0.45416433  0.36912176 -0.04111901
 -0.3435909  -0.13083233  0.07110099 -0.23444724  0.26429185  0.31326798
  0.20615076 -0.23127083 -0.11359369  0.21303149 -0.19785886  0.32893217
 -0.14822693  0.02602408]

"asdhasjhdkajshd" is not in the train set. And i want to know how do the model predict it?

Most helpful comment

FT breaks down each word into a bag of n-grams of chars, like

'awesome' => <aw>, <awe>, <wes>, <eso>, <som>, <ome>, <me>
if we set minn = maxn = 3

each subword n-grams are assigned a vector value when an OOV(out of vocabulary) word is encountered FT will try and build a vector by summing up subword vectors that would make up the word, so if you try to get a vector for awme then a vector sum of subwords <aw> and <me> is returned.

This is what makes FT robust in dealing with misspelled words and internet slag.

Also subword vector is not same as word vector <me> != me

you can get your subwords with model.get_subwords('asdhasjhdkajshd')

FT unsupervised model is based on this paper Enriching Word Vectors with Subword Information

>All comments

FT breaks down each word into a bag of n-grams of chars, like

'awesome' => <aw>, <awe>, <wes>, <eso>, <som>, <ome>, <me>
if we set minn = maxn = 3

each subword n-grams are assigned a vector value when an OOV(out of vocabulary) word is encountered FT will try and build a vector by summing up subword vectors that would make up the word, so if you try to get a vector for awme then a vector sum of subwords <aw> and <me> is returned.

This is what makes FT robust in dealing with misspelled words and internet slag.

Also subword vector is not same as word vector <me> != me

you can get your subwords with model.get_subwords('asdhasjhdkajshd')

FT unsupervised model is based on this paper Enriching Word Vectors with Subword Information

Was this page helpful?
0 / 5 - 0 ratings