Hi,
I am aware that accuracy is computed without taking into account true negatives (tn) as per issue #483 . However, the following is the output produced for a 3 class sentiment classification task (N|P|NEU, all examples are predicted and assigned exactly one class):
Testing using best model ...
loading file flair-exps/eu/models100/best-model.pt
MICRO_AVG: acc 0.5221 - f1-score 0.6861
MACRO_AVG: acc 0.5197 - f1-score 0.6837
N tp: 203 - fp: 108 - fn: 101 - tn: 757 - precision: 0.6527 - recall: 0.6678 - accuracy: 0.4927 - f1-score: 0.6602
NEU tp: 300 - fp: 129 - fn: 147 - tn: 593 - precision: 0.6993 - recall: 0.6711 - accuracy: 0.5208 - f1-score: 0.6849
P tp: 299 - fp: 130 - fn: 119 - tn: 621 - precision: 0.6970 - recall: 0.7153 - accuracy: 0.5456 - f1-score: 0.7060
It seems to me that the micro-averaged accuracy is not correctly computed in this case. I would expect that MICRO AVG acc to be equal to MICRO AVG f1-score. In fact, if we compute accuracy (correct predictions/total predictions) with the above numbers (203+300+299)/ 1169 = 0.6861.
Shouldn't it be MICRO_AVG: acc 0.6861 - f1-score 0.6861 instead of MICRO_AVG: acc 0.5221 - f1-score 0.6861 ?
I think the problem is in training_utils.py#L120, because when calling self.accuracy(None), the total number predictions are computed as the sum of tps,fps and fns, which is not the actual number of samples in the test set.
Again I'm in a multiclass single-label text classification scenario. I haven't tested other tasks.
In any case, thanks for the great work!
Hello @isanvicente thanks for reporting this - we'll take a closer look!
I found a similar problem, the way of calculating accuracy in the metrics.py should be (tp+tn)/(tp+tn+fp+fn)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Most helpful comment
Hello @isanvicente thanks for reporting this - we'll take a closer look!