Hi, quick question.
I see that word n-grams (i.e., -wordNgrams) are only used in supervised mode, and not in cbow nor skipgram.
Is there a reason for this?
The documentation is not clear on this point, but the code calls addWordNgrams() only here inside Dictionary::getLine() used by supervised training and not here in the equivalent Dictionary::getLine() used by unsupervised methods.
Thanks.
Hi @mino98,
Yes, you are correct: word n-grams are only used in supervised mode.
The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *). Thus, using word n-grams does not significantly improves the quality of learned models.
One way to address this issue is to only consider "informative" n-grams, such as New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases
and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City by New_York_City with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.
Please re-open this issue if you have additional questions!
Best,
Edouard
Hi @mino98,
Yes, you are correct: word n-grams are only used in supervised mode.
The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type
the *). Thus, using word n-grams does not significantly improves the quality of learned models.One way to address this issue is to only consider "informative" n-grams, such as
New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such asNew York CitybyNew_York_Citywith a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.Please re-open this issue if you have additional questions!
Best,
Edouard
It`s a good thinking!
Hello @EdouardGrave . I am a bit confused about the usage of embeddings for phrases/collocates. The question is this: in this paper https://arxiv.org/abs/1712.09405 it is mentioned:
We plan to release the model containing all the phrases in the near future
So, do the latest English models on the fasttext.cc website contain embeddings for phrases like, for example, New_Yorkor United_States, or not? Thank you.
Most helpful comment
Hi @mino98,
Yes, you are correct: word n-grams are only used in supervised mode.
The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type
the *). Thus, using word n-grams does not significantly improves the quality of learned models.One way to address this issue is to only consider "informative" n-grams, such as
New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrasesand their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as
New York CitybyNew_York_Citywith a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.Please re-open this issue if you have additional questions!
Best,
Edouard