Flair: TextClassifier label predictions always have score 1.0

Created on 9 Mar 2019 · 11Comments · Source: flairNLP/flair

Hi guys,

I have trained my own TextClassifier following this tutorial: https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A_MODEL.md

I am using the option multi_label=False, as each sentence should be assigned only one label. In terms of embeddings, I use FlairEmbeddings mix-forward and mix-backward.

The issue is that every time I predict the label of a new unseen sentence, I get a label score = 1.0. It seems that it never has any different value between 0.0 and 1.0. It is always 1.0.

Is this the expected behavior? What am I doing wrong?

Thanks!

bug

Source

algomaks

Most helpful comment

I did some tests and it seems to work very well. Thanks once again! :)

algomaks on 21 Apr 2019

🎉2

All 11 comments

Hello @algomax that should not happen. Which dataset are you using? Does the model train OK, i.e. does it reach good F-scores?

alanakbik on 12 Mar 2019

Hi @alanakbik,

I have a relatively small dataset, around 3000 examples in total. The examples fall in around 30 different classes, so about 100 examples per class. The examples are short sentences - things that you would typically say in a chat conversation (e.g "how are you today" or "cancel this task please"). Something specific about the dataset is that many of the examples within one category are quite similar - usually a few words difference.

When I train the classification model, I get pretty good micro and macro F1 scores - above 90%. However, I face two main problems when I apply the model to new unseen examples:

Everything is classified with score 1.0. Always.
When I try to predict an example which is clearly not in any of the 30 classes, the model predicts one of the 30 classes with confidence 1.0.

It seems that the model does not learn a good representation of the data. Could it be due to overfitting? How could I train the model so that examples which do not fall in any class are not assigned a label?

Thanks!

algomaks on 18 Mar 2019

Hi @algomax score 1.0 definitely sound like a bug so this is something we need to check out.

Is your training data always labeled, i.e. are there any examples that do not belong to any of the 30 classes? If you do not have such examples, the model will believe that each item must belong to one of the 30 classes - simply because in the training data it has never seen otherwise. So this explains at least why a class is always predicted, but not why confidence is always 1.0.

I'll add a bug label to this issue - we'll take a closer look.

alanakbik on 18 Mar 2019

Hi @alanakbik,

Do we have an update on this? I am facing a similar issue.

Thanks!

ashahzada on 5 Apr 2019

Hi @ashahzada,

As far as I know, there is no update so far on this issue. It seems to be a bug or we are somehow training the model incorrectly. Until this issue is resolved, I decided to use a different text classification tool. I hope this helps. Regards!

algomaks on 7 Apr 2019

Thank you @algomax !

ashahzada on 8 Apr 2019

Hello @ashahzada @algomax - we haven't looked into the error yet, but I'll be back in office next week (currently travelling) and will take a look.

alanakbik on 9 Apr 2019

👍2

I found certain clues in the code that might be causing the issue:

The first clue is that this line in the predict method of the TextClassifier class:

scores = self.forward(batch)

in my case generates a tensor like this:

tensor([[ 2.2129,  0.2944,  5.3489, -0.7817,  1.1176, -0.6716, -0.5714, -2.0608,
          3.6055,  3.1275,  1.4636,  4.2829,  1.8713, -2.6787, -0.4784,  0.6990,
          0.5859, -1.5066, -3.6451, -1.7429, -4.3391,  2.3591, -1.4598, -0.2570,
         -1.7566,  0.0564, -5.3699, -5.1569,  2.4890, -2.1881,  4.4994,  0.4503,
          3.9420, -4.4185,  1.2770,  3.5883,  5.4932, -0.7858, -4.2145,  2.3449,
         -2.8780]])

Please notice that the scores in the tensor range between -6.0 to +6.0.

The second clue is that the score setter of the Label class check whether the score is within the range [0, 1] and if not, sets it to 1.0. This is exactly why we always get predictions with score 1.0. For example, when the max score of 5.3489 is selected, it will be reduced to 1.0 by the setter.

@score.setter
    def score(self, score):
        if 0.0 <= score <= 1.0:
            self._score = score
        else:
            self._score = 1.0

The next step probably is to figure out why certain scores are estimated outside the [0, 1] range. Any thoughts?

algomaks on 14 Apr 2019

Thanks! Looks like there's a softmax missing - I'll put in a PR!

alanakbik on 15 Apr 2019

👍1

Added the fix to master - should work now but let me know if it doesn't!

alanakbik on 16 Apr 2019

I did some tests and it seems to work very well. Thanks once again! :)

algomaks on 21 Apr 2019

🎉2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

About NER corpus

aschmu · 3Comments

Getting tagwise accuracy/fscore metrics in sequence tagger

mnishant2 · 3Comments

NER task using Flair BertEmbeddings VS HuggingFace scripts

ChessMateK · 3Comments

Sentiment analysis pipeline not working

happypanda5 · 3Comments

How to turn predicted label from flair.data.Label into int?

jannenev · 3Comments