Fasttext: ns, hs, softmax in practise

Created on 2 May 2018  路  6Comments  路  Source: facebookresearch/fastText

The training options for the loss function currently supported are ns, hs, softmax, where

  • ns, Skpgram negative sampling or SGNS
  • hs, Skipgram Hierarchical softmax
  • softmax

Among the papers, an interesting and recent explanation of these methods is provided in Embeddings Learned by Gradient Descent.

By the way in the current FastText docs and tutorial there is no clear explanation about when using Negative Sampling, Softmax or Hierarchical Softmax for the ML practitioners this would be very worth indeed, considering different problem domains like

  1. Binary Classification (e.g. ham or spam)
  2. MultiClass Classification (e.g. sports, nature, politics, etc.) - _we have multiple labels and we want the model to give us the most likely single label_
  3. MultiLabel Classification (e.g. car+person) - _we have multiple labels and we want multiple labels per output_

for the output data. Typically for 1) and 3) the sigmoid would be used, giving us a probability for every class independently, while a softmax layer would be used for 2) to get the probability distribution over the N classes plus an arg-max() function to get the best scores. For large scale labels we would use the hs instead to speed up the training as explained in Text Classification - Scaling things up

This in theory (a good explanation is provided in the recent book "_Deep Learning, A Practitioners' Approach, Adam Gibson & Josh Patterson_"), but when working in FastText, which are the best options and which is the right way to handle the output probabilities for those 3 different cases?

A partial recommendation was provided in https://github.com/facebookresearch/fastText/issues/478#issuecomment-380524396 where the suggested loss function for the case 3) is the softmax at this time, while it would be possible to use ns but not for testing purposes (the test will use a softmax anyways).
Could you please provide a more clear docs about these 3 different important cases in text classification?

Most helpful comment

Hi @loretoparisi,

Thank you for raising this issue.

For binary classification, both the softmax and sigmoid are equivalent.

For multi-label classification, using the softmax is equivalent to predicting the distribution of labels, which is a valid approach. The main difference between 2) and 3) in this case is at prediction time: for 2) the argmax is used, while for 3) either a fixed number of labels are predicted or all labels above a given threshold are predicted. Both options are supported by fastText, with the following:

./fasttext predict <model> <test_data> <k> <threshold>

In that case, fastText will predict at most k labels, which have a score higher than threshold. Thus using threshold = 0.0 is equivalent to predicting a fixed number of labels and using k = number of classes is equivalent to predicting all labels with a score higher than threshold. In practice, threshold should be chosen on a validation set.

An alternative solution is to reframe the problem as one-vs-all, in which a binary classifier is trained for each label (indicating whether the label is present or not). This solution is currently not supported by fastText, and we might add it in the future. Based on experience, this approach obtains similar results to the previous one in practice.

I hope this answers your questions.

Best,
Edouard

All 6 comments

Hi @loretoparisi,

Thank you for raising this issue.

For binary classification, both the softmax and sigmoid are equivalent.

For multi-label classification, using the softmax is equivalent to predicting the distribution of labels, which is a valid approach. The main difference between 2) and 3) in this case is at prediction time: for 2) the argmax is used, while for 3) either a fixed number of labels are predicted or all labels above a given threshold are predicted. Both options are supported by fastText, with the following:

./fasttext predict <model> <test_data> <k> <threshold>

In that case, fastText will predict at most k labels, which have a score higher than threshold. Thus using threshold = 0.0 is equivalent to predicting a fixed number of labels and using k = number of classes is equivalent to predicting all labels with a score higher than threshold. In practice, threshold should be chosen on a validation set.

An alternative solution is to reframe the problem as one-vs-all, in which a binary classifier is trained for each label (indicating whether the label is present or not). This solution is currently not supported by fastText, and we might add it in the future. Based on experience, this approach obtains similar results to the previous one in practice.

I hope this answers your questions.

Best,
Edouard

@EdouardGrave thanks a lot Edouard, this is a very important point, and also source of discussion. In the recent book by Adam Gibson "Deep Learning" (2017) there are also some comment about these two approach, and it's pretty interesting to compare to yours. For our reference it's here.

Closing then, thanks again.

Dear @wwfwwf

Please edit and re-frame as a single question instead of writing each line as a new comment.

What is the error which you are getting?

Sorry!
I set the param as follows an return the error:
Traceback (most recent call last):
File "3_train_and_predict.py", line 17, in
classifier = fasttext.supervised("fasttext_train-1120-v4.txt", "fasttext-1122-v4.model",dim=230,min_count=2,neg=3,word_ngrams=2,lr=0.99,bucket=5000000,thread=25,ws=20,epoch=20,loss=hs,label_prefix="__label__")
NameError: name 'hs' is not defined

Sorry!
I set the param as follows an return the error:
Traceback (most recent call last): File "3_train_and_predict.py", line 17, in classifier = fasttext.supervised("fasttext_train-1120-v4.txt", "fasttext-1122-v4.model",dim=230,min_count=2,neg=3,word_ngrams=2,lr=0.99,bucket=5000000,thread=25,ws=20,epoch=20,loss=hs,label_prefix="label") NameError: name 'hs' is not defined

Can you understand what i said? @a11apurva I mean is it because i use the wrong parameter(loss=hs). If not, how should i set it?

Use loss = 'hs', the problem is python is reading hs as a variable in your case.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

pengyu picture pengyu  路  3Comments

shriiitk picture shriiitk  路  3Comments

kurtjanssensai picture kurtjanssensai  路  3Comments

leonardgithub picture leonardgithub  路  4Comments

ragvri picture ragvri  路  3Comments