The training options for the loss function currently supported are ns, hs, softmax, where
Among the papers, an interesting and recent explanation of these methods is provided in Embeddings Learned by Gradient Descent.
By the way in the current FastText docs and tutorial there is no clear explanation about when using Negative Sampling, Softmax or Hierarchical Softmax for the ML practitioners this would be very worth indeed, considering different problem domains like
ham or spam)sports, nature, politics, etc.) - _we have multiple labels and we want the model to give us the most likely single label_car+person) - _we have multiple labels and we want multiple labels per output_for the output data. Typically for 1) and 3) the sigmoid would be used, giving us a probability for every class independently, while a softmax layer would be used for 2) to get the probability distribution over the N classes plus an arg-max() function to get the best scores. For large scale labels we would use the hs instead to speed up the training as explained in Text Classification - Scaling things up
This in theory (a good explanation is provided in the recent book "_Deep Learning, A Practitioners' Approach, Adam Gibson & Josh Patterson_"), but when working in FastText, which are the best options and which is the right way to handle the output probabilities for those 3 different cases?
A partial recommendation was provided in https://github.com/facebookresearch/fastText/issues/478#issuecomment-380524396 where the suggested loss function for the case 3) is the softmax at this time, while it would be possible to use ns but not for testing purposes (the test will use a softmax anyways).
Could you please provide a more clear docs about these 3 different important cases in text classification?
Hi @loretoparisi,
Thank you for raising this issue.
For binary classification, both the softmax and sigmoid are equivalent.
For multi-label classification, using the softmax is equivalent to predicting the distribution of labels, which is a valid approach. The main difference between 2) and 3) in this case is at prediction time: for 2) the argmax is used, while for 3) either a fixed number of labels are predicted or all labels above a given threshold are predicted. Both options are supported by fastText, with the following:
./fasttext predict <model> <test_data> <k> <threshold>
In that case, fastText will predict at most k labels, which have a score higher than threshold. Thus using threshold = 0.0 is equivalent to predicting a fixed number of labels and using k = number of classes is equivalent to predicting all labels with a score higher than threshold. In practice, threshold should be chosen on a validation set.
An alternative solution is to reframe the problem as one-vs-all, in which a binary classifier is trained for each label (indicating whether the label is present or not). This solution is currently not supported by fastText, and we might add it in the future. Based on experience, this approach obtains similar results to the previous one in practice.
I hope this answers your questions.
Best,
Edouard
@EdouardGrave thanks a lot Edouard, this is a very important point, and also source of discussion. In the recent book by Adam Gibson "Deep Learning" (2017) there are also some comment about these two approach, and it's pretty interesting to compare to yours. For our reference it's here.
Closing then, thanks again.
Dear @wwfwwf
Please edit and re-frame as a single question instead of writing each line as a new comment.
What is the error which you are getting?
Sorry!
I set the param as follows an return the error:
Traceback (most recent call last):
File "3_train_and_predict.py", line 17, in
classifier = fasttext.supervised("fasttext_train-1120-v4.txt", "fasttext-1122-v4.model",dim=230,min_count=2,neg=3,word_ngrams=2,lr=0.99,bucket=5000000,thread=25,ws=20,epoch=20,loss=hs,label_prefix="__label__")
NameError: name 'hs' is not defined
Sorry!
I set the param as follows an return the error:
Traceback (most recent call last): File "3_train_and_predict.py", line 17, in classifier = fasttext.supervised("fasttext_train-1120-v4.txt", "fasttext-1122-v4.model",dim=230,min_count=2,neg=3,word_ngrams=2,lr=0.99,bucket=5000000,thread=25,ws=20,epoch=20,loss=hs,label_prefix="label") NameError: name 'hs' is not defined
Can you understand what i said? @a11apurva I mean is it because i use the wrong parameter(loss=hs). If not, how should i set it?
Use loss = 'hs', the problem is python is reading hs as a variable in your case.
Most helpful comment
Hi @loretoparisi,
Thank you for raising this issue.
For binary classification, both the softmax and sigmoid are equivalent.
For multi-label classification, using the softmax is equivalent to predicting the distribution of labels, which is a valid approach. The main difference between 2) and 3) in this case is at prediction time: for 2) the argmax is used, while for 3) either a fixed number of labels are predicted or all labels above a given threshold are predicted. Both options are supported by fastText, with the following:
In that case, fastText will predict at most
klabels, which have a score higher thanthreshold. Thus usingthreshold = 0.0is equivalent to predicting a fixed number of labels and usingk = number of classesis equivalent to predicting all labels with a score higher thanthreshold. In practice,thresholdshould be chosen on a validation set.An alternative solution is to reframe the problem as one-vs-all, in which a binary classifier is trained for each label (indicating whether the label is present or not). This solution is currently not supported by fastText, and we might add it in the future. Based on experience, this approach obtains similar results to the previous one in practice.
I hope this answers your questions.
Best,
Edouard