Fasttext: Interpreting Multilabel output

Created on 30 Jun 2017  路  17Comments  路  Source: facebookresearch/fastText

So I loaded multilabel values for my targets. But when I use the predict_prob function; it seems like conditional probablity more than multilabel output.

I was assuming that all the labels would have a value between 1 and 0, but I am seeing that all the labels add up to 1 instead for each class to have a value between 1 and 0.

Can someone help me understand this output.

Most helpful comment

I think it's also worth noting that a multilabel model has no way of telling you that none of the labels are appropriate for a certain document.
Say we have three labels: mathematics, biology, computing. Some of our ground truth data has one of the labels, some has two, some has all three, and some has none of them. At prediction time, the probability mass of 1.0 will always be distributed across the three labels, but a document about art should probably not have any of the three labels. Prediction @k does not help here, because we would have to know the correct k for each document.
Training separate binary models (mathematics vs. non-mathematics, biology vs. non-biology and computing vs. non-computing) is a way to deal with this, as suggested by @adam2326.

All 17 comments

Output like this is the result of the softmax function. My guess is the only way to get the output you want is to build a classifier for each label. So "Hotdog/NotHotdog" over and over.

It seems like fasttext is multilabel input, but not multilabel output. I am reading this from fasttext readme . If I have 15 classes and they all add up to 1. How can you tell if the multilabel prediction you made on one target is correct? I know it is arranged according to most likely, but how do you tell when everything adds up to one.

Hello @iymitchell,

fasttext's predict subcommand (and more specifically predict-proba) allows you to print the k most likely labels plus their probabilities. Indeed we're predicting a distribution over all labels, so you can expect them to add up to one. The most likely label has the highest probability.

To check if a prediction is correct you'll need to have groundtruth labels. You can then take the most likely prediction and compare against those.

I think I might not have understood your question correctly. Please help me understand, if this is the case.

Thanks,
Christian

@iymitchell yea, it does not produce the output you would expect given that it accepts mutlilabel input. Given multilabel input, you might expect to see a predicted probability for each label independent of all other labels. this is a limitation (again, a result of the softmax) and why I mentioned above you may want to create one model for each category level that you have. Then you would be able to accept only predictions greater than some threshold (say 75%). Right now, if you had a document that truthfully flagged to 50 different categories, your best prediction might be as low as 2%. not very useful in terms of mutlilabel predictions.

How fasttext process the mutlilabel input ?Dose it just split labels for many lines, then each line with a label? @adam2326

@cpuhrsch
would it be difficult to add another loss than softmax so we can have independent probabilities per label?

fasttext's predict subcommand (and more specifically predict-prob) allows you to print the k most likely labels plus their probabilities.

aha predict-prob that was what i was looking for!

I think it's also worth noting that a multilabel model has no way of telling you that none of the labels are appropriate for a certain document.
Say we have three labels: mathematics, biology, computing. Some of our ground truth data has one of the labels, some has two, some has all three, and some has none of them. At prediction time, the probability mass of 1.0 will always be distributed across the three labels, but a document about art should probably not have any of the three labels. Prediction @k does not help here, because we would have to know the correct k for each document.
Training separate binary models (mathematics vs. non-mathematics, biology vs. non-biology and computing vs. non-computing) is a way to deal with this, as suggested by @adam2326.

@dietmar What you may want is a sigmoid + cross entropy loss per label.

@dietmar According to your answer and @adam2326 suggestion, let's assume we have a multi-label problem like classify tweets with 10 moods let's say Happy, Aggressive, Energetic, Rowdy, Chillout, Sprightly, Gloomy, Effervescent, Bright, Atmospheric.
So I have to build a binary model for each of them, k-^k like happy-non_happy. This means that, assumed we have a labeled dataset of 1M tweets uniformly distributed across these labels, and so we have 10 labels in this dataset, do I have to split the tweets dataset by each label?

@pommedeterresautee yes! I was looking for the same thing. Looks like the multi-label prediction (or top K prediction) output by fastText is equivalent to training with a softmax + cross entropy loss. Is there anyway to get the sigmoid + cross entropy loss output instead?

Hello @iymitchell,

Thank you again for your post. The following issue appears related (and is newer) than this one. It also appears to be closer to the topic of discussion of the recent comments on this thread. Please consider commenting on it instead and closing this to deduplicate issues.

https://github.com/facebookresearch/fastText/issues/363

Thanks,
Christian

@cpuhrsch this is issue #262, probably you meant to reference something else.

You're right @dietmar! I meant #363.

For completeness there were actually four, the last two of those four were https://github.com/facebookresearch/fastText/issues/185 and https://github.com/facebookresearch/fastText/issues/72 :) so I think that different of them could be closed to reduce entropy :D

363 has nothing really in it, so I figured I'd comment here, after seeing that comment.

@dietmar I'm not certain that this line of reasoning is completely correct, as you'd really want to optimize your confidence threshold anyway, and then to make a long story short, you _are able_ to derive that a data item belongs in no class, even with multi-label classification. All you need to complete that path, is add a category "none" to all those data items that belong in no category. I'm happy to have your comments to see what I might be missing...

Hi,
You can now have an independent sigmoid for each label, by using -loss ova or -loss one-vs-all.

More information here.

Thank you very much for your feedback!

Best regards,
Onur

Was this page helpful?
0 / 5 - 0 ratings

Related issues

AhmedIdr picture AhmedIdr  路  3Comments

nomadlx picture nomadlx  路  3Comments

shriiitk picture shriiitk  路  3Comments

hughbzhang picture hughbzhang  路  3Comments

PGryllos picture PGryllos  路  4Comments