Flair: TextClassifier : reported metrics after training always report precision=1.0

Created on 9 Jul 2020 · 7Comments · Source: flairNLP/flair

Describe the bug
The reported metrics after training always report precision=1.0.

To Reproduce

Training code:

from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 1. get the corpus
corpus: Corpus = TREC_6()

# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

# 6. start the training
trainer.train('/tmp/taggers/trec',
              learning_rate=3e-5, # use very small learning rate
              mini_batch_size=16,
              mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
              max_epochs=5, # terminate after 5 epochs
              )
 ```

Example of produced report:
```text
2020-07-09 09:50:21,395 Testing using best model ...
2020-07-09 09:50:21,395 loading file /tmp/taggers/trec/best-model.pt
2020-07-09 09:50:27,486         0.964
2020-07-09 09:50:27,487 
Results:
- F-score (micro) 0.9823
- F-score (macro) 0.9745
- Accuracy 0.964

By class:
              precision    recall  f1-score   support

        DESC     1.0000    0.9931    0.9965       145
        ENTY     1.0000    0.8750    0.9333        96
        ABBR     1.0000    0.8889    0.9412         9
         HUM     1.0000    0.9851    0.9925        67
         NUM     1.0000    0.9915    0.9957       117
         LOC     1.0000    0.9762    0.9880        84

   micro avg     1.0000    0.9653    0.9823       518
   macro avg     1.0000    0.9516    0.9745       518
weighted avg     1.0000    0.9653    0.9818       518
 samples avg     1.0000    0.9820    0.9880       518

2020-07-09 09:50:27,487 ----------------------------------------------------------------------------------------------------

Expected behavior
Reports correct metrics.

Screenshots
N/A

Environment (please complete the following information):

OS [e.g. iOS, Linux]: CentOS
Version: flair 0.5.1, scikit-learn==0.23.1

Additional context
Same problem with other datasets.

bug

Source

GuillaumeDD

👍2

Most helpful comment

From the master it looks like better:

2020-07-09 12:46:46,615 ----------------------------------------------------------------------------------------------------
2020-07-09 12:46:46,615 Testing using best model ...
2020-07-09 12:46:46,616 loading file /tmp/taggers/trec/best-model.pt
2020-07-09 12:46:51,939         0.97
2020-07-09 12:46:51,939 
Results:
- F-score (micro) 0.97
- F-score (macro) 0.9665
- Accuracy 0.97

By class:
              precision    recall  f1-score   support

        DESC     0.9384    0.9928    0.9648       138
        ENTY     0.9882    0.8936    0.9385        94
        ABBR     1.0000    0.8889    0.9412         9
         HUM     0.9846    0.9846    0.9846        65
         NUM     0.9739    0.9912    0.9825       113
         LOC     0.9877    0.9877    0.9877        81

   micro avg     0.9700    0.9700    0.9700       500
   macro avg     0.9788    0.9564    0.9665       500
weighted avg     0.9709    0.9700    0.9697       500
 samples avg     0.9700    0.9700    0.9700       500
2020-07-09 12:46:51,939 ----------------------------------------------------------------------------------------------------

GuillaumeDD on 9 Jul 2020

👍2

All 7 comments

Thanks for reporting this! This seems to be an error in the evaluation routine that occurs if no label_type is passed to the model. Can you run the above code with

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type='question_type')

I will put in a PR shortly that fixes this if no label type is passed.

(Edit: label_type instead of label_name)

alanakbik on 9 Jul 2020

This does not seem to solve the problem.

Here is what I have tested following the suggested code:

from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 1. get the corpus
corpus: Corpus = TREC_6()

# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict, label_type='question_type')

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

# 6. start the training
trainer.train('/tmp/taggers/trec',
              learning_rate=3e-5, # use very small learning rate
              mini_batch_size=16,
              mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
              max_epochs=5, # terminate after 5 epochs
              )

Results:

2020-07-09 11:45:16,849 ----------------------------------------------------------------------------------------------------
2020-07-09 11:45:16,850 Testing using best model ...
2020-07-09 11:45:16,850 loading file /tmp/taggers/trec/best-model.pt
2020-07-09 11:45:22,217         0.964
2020-07-09 11:45:22,218 
Results:
- F-score (micro) 0.9823
- F-score (macro) 0.9845
- Accuracy 0.964

By class:
              precision    recall  f1-score   support

        ENTY     1.0000    0.8947    0.9444        95
        DESC     1.0000    0.9653    0.9823       144
        ABBR     1.0000    1.0000    1.0000        10
         HUM     1.0000    0.9851    0.9925        67
         NUM     1.0000    1.0000    1.0000       120
         LOC     1.0000    0.9756    0.9877        82

   micro avg     1.0000    0.9653    0.9823       518
   macro avg     1.0000    0.9701    0.9845       518
weighted avg     1.0000    0.9653    0.9820       518
 samples avg     1.0000    0.9820    0.9880       518

2020-07-09 11:45:22,218 ----------------------------------------------------------------------------------------------------

GuillaumeDD on 9 Jul 2020

Argh, you're right. I just pushed a PR that I believe fixes this. Could you try installing from master?

pip install --upgrade git+https://github.com/flairNLP/flair.git

alanakbik on 9 Jul 2020

🚀1 👍1

From the master it looks like better:

2020-07-09 12:46:46,615 ----------------------------------------------------------------------------------------------------
2020-07-09 12:46:46,615 Testing using best model ...
2020-07-09 12:46:46,616 loading file /tmp/taggers/trec/best-model.pt
2020-07-09 12:46:51,939         0.97
2020-07-09 12:46:51,939 
Results:
- F-score (micro) 0.97
- F-score (macro) 0.9665
- Accuracy 0.97

By class:
              precision    recall  f1-score   support

        DESC     0.9384    0.9928    0.9648       138
        ENTY     0.9882    0.8936    0.9385        94
        ABBR     1.0000    0.8889    0.9412         9
         HUM     0.9846    0.9846    0.9846        65
         NUM     0.9739    0.9912    0.9825       113
         LOC     0.9877    0.9877    0.9877        81

   micro avg     0.9700    0.9700    0.9700       500
   macro avg     0.9788    0.9564    0.9665       500
weighted avg     0.9709    0.9700    0.9697       500
 samples avg     0.9700    0.9700    0.9700       500
2020-07-09 12:46:51,939 ----------------------------------------------------------------------------------------------------

GuillaumeDD on 9 Jul 2020

👍2

I have also encountered this problem
By class:
precision recall f1-score support

       4     1.0000    0.4470    0.6179      1322
       1     1.0000    0.5561    0.7148      3064
       3     1.0000    0.5726    0.7282      2513
       2     1.0000    0.3269    0.4927      1661
       5     1.0000    0.7080    0.8290      2325
       0     1.0000    0.8719    0.9316      4677

micro avg 1.0000 0.6427 0.7825 15562
macro avg 1.0000 0.5804 0.7190 15562
weighted avg 1.0000 0.6427 0.7672 15562
samples avg 1.0000 0.7220 0.8147 15562

and I have tried to input:
pip install --upgrade git+https://github.com/flairNLP/flair.git

but here the error comes out:
ERROR: Command errored out with exit status 128: git clone -q https://github.com/flairNLP/flair.git /tmp/pip-req-build-f1vp5jsw Check the logs for full command output.

can you help me @alanakbik