In the source code, textVector() and sentenceVector() are both for generating the vector of a series of words. If the model is supervised text classification model, textVector() is used, and sentenceVector in other case. The only difference between them is that sentence vector is normed but textVector not. What is the reason for this?
I follow-up on this question because mine is related.
echo -e "__1 one two three\n__0 four five six" > train.txt
./fasttext/fasttext supervised -input train.txt -output model -dim 2 -label __
echo "one two" | ./fastText/fasttext print-word-vectors model.bin
one -0.46124 -0.36163
two 0.092583 -0.19925
echo "one two" | ./fastText/fasttext print-sentence-vectors model.bin
-0.27689 -0.083967
My result for sentence vector : -0.18278494 -0.761943035
Thanks,
Hello @XuelinZeng and @ilaxes,
Thank you for your post. The difference between getWordVector and getSentenceVector is that the latter used getWordVector to assemble a single vector for a sequence of tokens (words).
That means, you use getWordVector if you want to receive an embeddings for a single word and getSentenceVector if you want to get an embedding for a sentence (sequence of words).
For a supervised model getSentenceVector will simply average the word vectors for each word in a line of text. For all other models (cbow and skipgram) getSentenceVector will divide each word vector by it's norm and then average them. Now, it is important to keep in mind that any sentence will end with a newline. That means "one two" actually translates into the vectors for "one", "two" and EOS.
$ ./fasttext supervised -input dbpedia.train -output model -thread 10 -epoch 8 -verbose 2 -dim 2
Read 32M words
Number of words: 803537
Number of labels: 14
Progress: 100.0% words/sec/thread: 3130504 lr: 0.000000 loss: 0.507742 ETA: 0h 0m
$ ./fasttext print-sentence-vectors model.bin
one two
2.6772 -3.0886
$ ./fasttext print-word-vectors model.bin
one
one -0.0065439 0.1416
two
two 0.086216 0.10804
</s>
</s> 7.9518 -9.5155
We now have (-0.0065439 + 0.086216 + 7.9518) / 3 = 2.6772 and (0.1416 + 0.10804 + -9.5155) / 3 = -3.0886 as expected for a supervised model.
I'm closing this issue now as I consider it resolved, but please feel encouraged to reopen it at any time if you don't.
Thanks,
Christian
Hi,
I used minn=2, maxn=2 and trained my supervised model
When I used print-word-vectors, got
a 1.2892 0.35762
</s> -4.0258 4.9202
but print-sentence-vectors showed
-0.063226 1.2798
why not (1.2892 + -4.0258) / 2 = −1.3683?
Hi @cpuhrsch, thanks for the helpful comment. But the case you described works probably when the wordNgrams=1. How is the sentence vector calculated when for example wordNgrams=2. I suppose the vectors of n-grams are also calculated in the average. But I am trying for get the idea on just one word. And when I calculate (word_vec('x') + word_vec('</s>'))/2 != sent_vec('x').
Thanks in advance.
Hi @cpuhrsch , what kind of norm is the unsupervised model using? Besides, is this </s> necessary for corpora training?
Solved
After seeing the code at https://github.com/facebookresearch/fastText/blob/master/src/vector.cc#L35
I find it is L2 norm.
Besides, ther is no </s> for unsupervised model.
After hours searching, I think I need to clarify this:
from @cpuhrsch comment:
For all other models (cbow and skipgram) getSentenceVector will divide each word vector by it's norm and then average them
it needs to be noted that the averaging process is involve "divide each word vector by it's norm", so that's why your result is not same @seanappler @sipan17 you can see the source code here
it is also shown in the code above, that getSentenceVector only calculate the average of vectors that have positive L2 norm (see variable count). For example, if you use cc.en.300, a newline "\n" has 0 value L2 norm. So if your sentence is only "x" then the sum of the vector only divided by 1 (not 2)
The approach @rianrajagede is suggesting is not necessarily true as it doesn't work for my case where minn and maxn are 4 in a fasttext supervised classification model. The caveat here is that my model is infact a supervised model and simple averaging should work but it doesn't.
sentence_vector_2 = ft_model_2.get_sentence_vector('cordless drills')
n_grams = ft_model_2.get_subwords('cordless')[0] + ft_model_2.get_subwords('drills')[0] + ft_model_2.get_subwords('</s>')[0]
def div_norm(x):
norm_value = LA.norm(x)
if norm_value > 0:
return x * ( 1.0 / norm_value)
else:
return False
print('N-Grams:',n_grams)
start = np.zeros(100)
count = 0
for word in n_grams:
add = div_norm(ft_model_2[word])
if add.any() != False:
start += add
count +=1
print(count)
recreated_2 = start/count
print(np.round(recreated_2, 3) == np.round(sentence_vector_2, 3))
this yields an array of False's.
N-Grams: ['cordless', '<cor', 'cord', 'ordl', 'rdle', 'dles', 'less', 'ess>', 'drills', '<dri', 'dril', 'rill', 'ills', 'lls>', '</s>']
15
[False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False False False False False False False False False
False False False False]
I have tried everything I can think of including simple averaging and averaging vectors divided by their l2-norms and am still pretty unclear on how fasttext is exactly creating the word vector in subword cases. if anyone has any insight on this please share. much appreciated.
@cpuhrsch you are right for the simple case of minn and maxn being 0. However, this is not the case when minn and maxn are greater than 0 in a supervised model and the sentence vector does not seem to be a simple averaging of the n_gram vectors. This case should not be closed. For evidence you can see this https://github.com/facebookresearch/fastText/issues/966.
Has there been any clarification provided on this?
Most helpful comment
Hello @XuelinZeng and @ilaxes,
Thank you for your post. The difference between getWordVector and getSentenceVector is that the latter used getWordVector to assemble a single vector for a sequence of tokens (words).
That means, you use getWordVector if you want to receive an embeddings for a single word and getSentenceVector if you want to get an embedding for a sentence (sequence of words).
For a supervised model getSentenceVector will simply average the word vectors for each word in a line of text. For all other models (cbow and skipgram) getSentenceVector will divide each word vector by it's norm and then average them. Now, it is important to keep in mind that any sentence will end with a newline. That means "one two" actually translates into the vectors for "one", "two" and EOS.
We now have (-0.0065439 + 0.086216 + 7.9518) / 3 = 2.6772 and (0.1416 + 0.10804 + -9.5155) / 3 = -3.0886 as expected for a supervised model.
I'm closing this issue now as I consider it resolved, but please feel encouraged to reopen it at any time if you don't.
Thanks,
Christian