Fasttext: confusion regarding paper and implementation

Created on 2 Jun 2018  路  13Comments  路  Source: facebookresearch/fastText

Can any one remove my confusion, as stated in paper , Fasttext takes a window and replace middle word with label. Which means that words in window predict label and so on. All words of sentence does not participate altogether to predict label, in fact they participate as window is slided .

  • e.g. "a man is driving very fast"
    a man driving fast : ['is' is replaced by label]
    man is very fast : ['driving' is replaced by label]
  • then at which stage does FASTTEXT compute whole sentence vector? when does it takes average of all words of sentence and pass it to linear classifier?

Most helpful comment

Hi @omerarshad,

The fastText supervised model works as follow: each word (and word ngram) is associated to a vector representation (a.k.a. embedding, in dimension 100 by default). A representation for the input text is obtained by averaging the embeddings corresponding to the words and ngrams that appear in the input. Then, a linear classifier is used on this representation to obtain the score corresponding to each label. When training the model, both the word/ngram embeddings and the linear classifiers are learned, in one step (you are correct that there is two weight matrices, which are learned jointly).

It is also possible to learn word representations on large amount of unsupervised data, using either skipgram or cbow, and then use these representations to initialize the supervised model. When using a small amount of supervised data, this can improve the performance significantly. Please note that we distribute such high-quality pre-trained models on our website (https://fasttext.cc).

Regarding the sentence "This architecture is similar to the cbow model of Mikolov et al. (2013), where the middle word is replaced by a label" from our paper, it means the following. In the supervised model, we predict a label given the full input (without removing any word from the input), while in cbow, we predict the middle word, given the rest of the window. Both models use the same architecture (but data is used differently).

I hope this answers your questions regarding the model.

Best,
Edouard.

All 13 comments

That's in its classifier part. Might be a different paper, don't recall right now. Just search all the fasttext cited articles to find that.

there is only 1 paper for text classification i.e. " bag of tricks for efficient text classification"

It's written there, although in somewhat of an off-hand manner. read it more thoroughly...

from paper "The word representations are then averaged into a text representation, which is in
turn fed to a linear classifier"

so is it that first they learn word embeddings by replacing middle word with label ,and then those word embeddings are averaged and passed to linear classifier? If yes then there are two weight matrices to learn
1 ) embeddings
2) classifier

am i right?

Yes. The averaged embedding per sentence, is injected as the input layer to the classifier MLP. An embeddings file needs to be created or obtained prior to training a classification model (or else the classification will underperform).

@matanster can you please give reference to the paper and the section ?

https://github.com/facebookresearch/fastText#bag-of-tricks-for-efficient-text-classification, section 2. Note you might find the wording of this section a bit obtuse until you well immerse.

@matanster so you mean that first they learn word embeddings, and once they got embeddings, they learn classifier on it. So y its a 2 step process? can't we learn a multi objective loss function which reduces loss of word embeddings + MLP? My actual confusion is this. And if we learn separately then they have not reported any experiments for embeddings model

@matanster thanks for giving the reference!

Hi @omerarshad,

The fastText supervised model works as follow: each word (and word ngram) is associated to a vector representation (a.k.a. embedding, in dimension 100 by default). A representation for the input text is obtained by averaging the embeddings corresponding to the words and ngrams that appear in the input. Then, a linear classifier is used on this representation to obtain the score corresponding to each label. When training the model, both the word/ngram embeddings and the linear classifiers are learned, in one step (you are correct that there is two weight matrices, which are learned jointly).

It is also possible to learn word representations on large amount of unsupervised data, using either skipgram or cbow, and then use these representations to initialize the supervised model. When using a small amount of supervised data, this can improve the performance significantly. Please note that we distribute such high-quality pre-trained models on our website (https://fasttext.cc).

Regarding the sentence "This architecture is similar to the cbow model of Mikolov et al. (2013), where the middle word is replaced by a label" from our paper, it means the following. In the supervised model, we predict a label given the full input (without removing any word from the input), while in cbow, we predict the middle word, given the rest of the window. Both models use the same architecture (but data is used differently).

I hope this answers your questions regarding the model.

Best,
Edouard.

great, so one LAST question

in supervised(classification) model : do we have separate error for word.embeddings or does the same classification error back propagates till embedding layer?

As far as I recall, training error propagates all the way down, thusly further tuning the embeddings during the classification training. Does that address your question?

Hi @omerarshad,

In supervised model, the classification error back propagates to the embedding layer, thus fine tuning the embeddings (as pointed out by @matanster).

Best,
Edouard

Was this page helpful?
0 / 5 - 0 ratings