Fasttext: wordNgrams in unsupervised mode (cbow and skipgram)

Created on 29 Apr 2018 · 3Comments · Source: facebookresearch/fastText

Hi, quick question.

I see that word n-grams (i.e., -wordNgrams) are only used in supervised mode, and not in cbow nor skipgram.

Is there a reason for this?

The documentation is not clear on this point, but the code calls addWordNgrams() only here inside Dictionary::getLine() used by supervised training and not here in the equivalent Dictionary::getLine() used by unsupervised methods.

Thanks.

Source

mino98

Most helpful comment

Hi @mino98,

Yes, you are correct: word n-grams are only used in supervised mode.

The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *). Thus, using word n-grams does not significantly improves the quality of learned models.

One way to address this issue is to only consider "informative" n-grams, such as New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases
and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City by New_York_City with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.

Please re-open this issue if you have additional questions!

Best,
Edouard

EdouardGrave on 24 May 2018

👍5

All 3 comments

Hi @mino98,

Yes, you are correct: word n-grams are only used in supervised mode.

Please re-open this issue if you have additional questions!

Best,
Edouard

EdouardGrave on 24 May 2018

👍5

Hi @mino98,

Yes, you are correct: word n-grams are only used in supervised mode.

The reason is that when training unsupervised models on large amount of data, the number of n-grams is extremely large, and most of them are not informative (e.g. all the bigrams of the type the *). Thus, using word n-grams does not significantly improves the quality of learned models.

One way to address this issue is to only consider "informative" n-grams, such as New York City. This can be done by only keeping n-grams made of words with high mutual information. This technique is described in Section 4 of the paper Distributed Representations of Words and Phrases and their Compositionality, and we used it to train our latest English word representations (see section 2.3 of the paper Advances in Pre-Training Distributed Word Representations for more information). In particular, we replaced high mutual information phrases such as New York City by New_York_City with a probability of 0.5 of the time in the training data. We are thinking about making this part of the fastText tool.

Please re-open this issue if you have additional questions!

Best,
Edouard

It`s a good thinking!

eduamf on 12 Dec 2019

Hello @EdouardGrave . I am a bit confused about the usage of embeddings for phrases/collocates. The question is this: in this paper https://arxiv.org/abs/1712.09405 it is mentioned:

We plan to release the model containing all the phrases in the near future

So, do the latest English models on the fasttext.cc website contain embeddings for phrases like, for example, New_Yorkor United_States, or not? Thank you.