Mask_rcnn: How to calculate F1 score in Mask RCNN?

Created on 7 May 2020 · 13Comments · Source: matterport/Mask_RCNN

I customized the "https://github.com/matterport/Mask_RCNN.git" repository to train with my own dataset. Now I am evaluating my results, I can calculate the MAP, but I cannot calculate the F1-Score.
I have this function: compute_ap, from "https://github.com/matterport/Mask_RCNN/blob/master/mrcnn/utils.py" that returns the "mAP, precisions, recalls, overlaps" for each image. The point is that I can't apply the F1-score formula because the variables "precisions" and "recalls" are lists.

def compute_ap(gt_boxes, gt_class_ids, gt_masks,
               pred_boxes, pred_class_ids, pred_scores, pred_masks,
               iou_threshold=0.5):

    # Get matches and overlaps
    gt_match, pred_match, overlaps = compute_matches(
        gt_boxes, gt_class_ids, gt_masks,
        pred_boxes, pred_class_ids, pred_scores, pred_masks,
        iou_threshold)

    # Compute precision and recall at each prediction box step
    precisions = np.cumsum(pred_match > -1) / (np.arange(len(pred_match)) + 1)
    recalls = np.cumsum(pred_match > -1).astype(np.float32) / len(gt_match)

    # Pad with start and end values to simplify the math
    precisions = np.concatenate([[0], precisions, [0]])
    recalls = np.concatenate([[0], recalls, [1]])

    # Ensure precision values decrease but don't increase. This way, the
    # precision value at each recall threshold is the maximum it can be
    # for all following recall thresholds, as specified by the VOC paper.
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = np.maximum(precisions[i], precisions[i + 1])

    # Compute mean AP over recall range
    indices = np.where(recalls[:-1] != recalls[1:])[0] + 1
    mAP = np.sum((recalls[indices] - recalls[indices - 1]) *
                 precisions[indices])

    return mAP, precisions, recalls, overlaps

Source

WillianaLeite

👍1

All 13 comments

Hi there, as you said, precisions and recalls are list of precision and recall for each instance. So if you take the mean of those list, you'll get the precision and the recall of the image. And here you go, you can apply the formula.

suchiz on 13 May 2020

👍1

@suchiz Thank you so much for your answer, it helped me a lot. I still have another question, could you help me? For each image I'm calculating the "comput_ap" function and I'm averaging the precision and recall that this function returns as you said, then I apply the formula of F1-Score = (2 * precision * recall) / precision + recall), then I add it to a list of f1-scores and return it, as in the following code:

def evaluate_model(dataset, model, cfg):
    APs = list(); 
    F1_scores = list(); 
    for image_id in dataset.image_ids:
        image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
        scaled_image = mold_image(image, cfg)
        sample = expand_dims(scaled_image, 0)
        yhat = model.detect(sample, verbose=0)
        r = yhat[0]
        AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'])
        F1_scores.append((2* (mean(precisions) * mean(recalls)))/(mean(precisions) + mean(recalls)))
        APs.append(AP)

        mAP = mean(APs)
        return mAP, F1_scores

This way it is as if I were calculating the f1-score for each image, is this correct? If it is correct, to get the single f1-score, should I apply an average? This way I'm making the list of f1-scores has some values like 0.0, so when I average it is a value lower than what I get with the mAP calculation, do you know what that means?

mAP: 0.745
list returns of F1-Score: [0.0, 0.6357388316151203, 0.6818181818181818, 0.5555555555555556, 0.6666666666666666, 0.6818181818181818, 0.6634146341463415, 0.6818181818181818, 0.2401656314699793, 0.6666666666666666, 0.6818181818181818, 0.7162790697674419, 0.7211538461538461, 0.35294117786273826, 0.6666666666666666, 0.46896552133786973, 0.44573643908229665, 0.5555555555555556, 0.5555555555555556, 0.360623784466973, 0.5939086386397096, 0.587196475462119, 0.6666666666666666, 0.0, 0.6172839577714765, 0.7128666035950805, 0.0, 0.30600339339801114, 0.6153846153846154, 0.57794677211439, 0.6250000058207661, 0.6666666666666666, 0.579710149857032, 0.5555555555555556, 0.6990504081238134, 0.6818181818181818, 0.6818181818181818, 0.6818181818181818, 0.6666666666666666, 0.6666666666666666, 0.6818181818181818, 0.6666666666666666, 0.6666666666666666, 0.6634146341463415, 0.5714285714285715, 0.5714285714285715, 0.6666666666666666, 0.5555555555555556, 0.6666666666666666, 0.4861878488571127, 0.6, 0.6818181818181818, 0.6172839577714765, 0.5526315789473685, 0.587196475462119, 0.6862745098039216, 0.6363636439065795, 0.5333333357175192, 0.7218309896422613, 0.6237623762376238, 0.5459770136263651, 0.5459770136263651, 0.5649350741823174, 0.6163697189477441, 0.5677419409419445, 0.6575689730129668, 0.6734693927821493, 0.7162790697674419, 0.6666666666666666, 0.6862745098039216, 0.7218309896422613, 0.5166473028065978, 0.5348210949559574, 0.6301969424548594, 0.7130970724191062, 0.5279192605168836, 0.6726799842059843, 0.6470588297683062, 0.6665870684399651, 0.6031079438636168, 0.6857142896068339, 0.6363636439065795, 0.684782610075911]

mean F1-score: 0.5915668385893322

I would be grateful if you could help me, this is an important job in my academic life, so please help me!!

Hi there, as you said, precisions and recalls are list of precision and recall for each instance. So if you take the mean of those list, you'll get the precision and the recall of the image. And here you go, you can apply the formula.

WillianaLeite on 14 May 2020

Hi there, Yes, it is correct. Averaging scores of each image gives the score for teh whole dataset. Those are 2 different metrics. mAP as it is said is the average precision whereas F1 score is a tradeoff between precision and recall. So what it means to have a mAP (to me at least, i'm not an expert either yet, and please someone correct me if i'm wrong) bigger than a F1 score is that your model is better at detecting classifying well the object you've detected than detecting all objects (basically low recall)

suchiz on 14 May 2020

👍1

@suchiz Thank you so much for your answer!

WillianaLeite on 14 May 2020

hi @suchiz it's me again, could you help me just one more time? I swear it's the last time (laughs).

I thought of another way of calculating the f1-score that gave me different results than I had previously, could you analyze if I did it correctly?

Here is my code:

def evaluate_model(dataset, model, cfg):
    APs = list(); ARs = list(); 
    for image_id in dataset.image_ids:
        image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
        scaled_image = mold_image(image, cfg)
        sample = expand_dims(scaled_image, 0)
        yhat = model.detect(sample, verbose=0)
        r = yhat[0]
        AP, _, _, _ = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'])
        AR, _ = compute_recall(r["rois"], gt_bbox, iou=0.5) 
                ARs.append(AR)
        APs.append(AP)
    # calculate the mean AP across all images
    mAP = mean(APs)
        mAR = mean(ARs) 
        return mAP, mAR


test_mAP, mARs_test = evaluate_model(test_set, model, cfg)

f_score_test = (2 * test_mAP * mARs_test)/(test_mAP + mARs_test)

print('f1-score-test', f_score_test)

With:

mAP: 0.745
mAR: 0.7763338972233372

I get this output, and I don't notice anything strange on the "mAR" list:

f1-score-test: 0.7603093578003618

To calculate this I am assuming that the "compute_recall" function returns the Average Recall (AR) of the image, is this correct?

I would be very grateful if you could help me again, anyway I leave my thanks! It is very difficult to fully understand code that does not have good documentation, so it is very good to know that there are people who are willing to contribute to the dissemination of knowledge.

WillianaLeite on 16 May 2020

Hi, can anyone help me out with mAP and f1 calculation? Is mAP calculated the same way as @WillianaLeite said f1 is calculated? by taking an average over all images? When I take an average of the compute_ap 's output over all the image_ids I'm getting a nan value. Thanks in advance.

tanmayj000 on 23 May 2020

Hi @tanmayj000, to calculate the mAP, I am following this tutorial: https://machinelearningmastery.com/how-to-train-an-object-detection-model-with-keras/. For me I did exactly the same, because I am training with only one class, so I take the average of all Average Precision found. About the nan value, sorry but I have no idea why you are getting this, I am new to this area of object detection.

WillianaLeite on 23 May 2020

hey @WillianaLeite how and where did you call the compute_ap function in order to calculate mAP
?

Nikk-27 on 24 May 2020

Hi @Nikk-27 after doing the training I create a data set and create a class to configure the network, as shown in the following code. As I mentioned in the topic of the question I have doubts if I am calculating the f1-score correctly. I presented two ways to calculate it, and both obtained very different results, the first way was that @suchiz suggested, who is also not sure, and the second way is using the utils.py "compute_recall" function, one of the two ways is correct, I just don't know which is, so I came to ask for help in that community.

This is my complete code, with two ways to calculate the f1-score:

from os import listdir
from xml.etree import ElementTree
from numpy import zeros
from numpy import asarray
from numpy import expand_dims
from numpy import mean
from mrcnn.config import Config
from mrcnn.model import MaskRCNN
from mrcnn.utils import Dataset
from mrcnn.utils import compute_ap, compute_recall
from mrcnn.model import load_image_gt
from mrcnn.model import mold_image



class PredictionConfig(Config):
    NAME = "acerola_cfg"
    NUM_CLASSES = 1 + 1
    GPU_COUNT = 1
    IMAGES_PER_GPU = 1


def evaluate_model(dataset, model, cfg):
    APs = list(); 
    F1_scores = list(); 
    for image_id in dataset.image_ids:
        image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
        scaled_image = mold_image(image, cfg)
        sample = expand_dims(scaled_image, 0)
        yhat = model.detect(sample, verbose=0)
        r = yhat[0]
        AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'])
                AR, positive_ids = compute_recall(r["rois"], gt_bbox, iou=0.2)
                ARs.append(AR)
        F1_scores.append((2* (mean(precisions) * mean(recalls)))/(mean(precisions) + mean(recalls)))
        APs.append(AP)

        mAP = mean(APs)
        mAR = mean(ARs)
        return mAP, mAR, F1_scores


test_set = AcerolaDataset()
test_set.load_dataset('path', is_train=False)
test_set.prepare()

cfg = PredictionConfig()
model = MaskRCNN(mode='inference', model_dir='/content/drive/My Drive/TCC-Mask-RCNN/', config=cfg)
model.load_weights('/content/drive/My Drive/TCC-Mask-RCNN/config-sem-flores-modificado/acerola_cfg20200523T2056/mask_rcnn_acerola_cfg_0056.h5', by_name=True)
mAP, mAR, F1_score = evaluate_model(test_set, model, cfg)
print("mAP: %.3f" % mAP)
print("mAR: %.3f" % mAR)
print("first way calculate f1-score: ", F1_score)

F1_score_2 = (2 * mAP * mAR)/(mAP + mAR)
print('second way calculate f1-score_2: ', F1_score_2)

Returns:

mAP: 0,934
mAR: 0.942
first way calculate f1-score: 0.66
second way calculate f1-score_2: 0.938

Being the first way @suchiz suggested: apply the formula of the f1-score: (2 * precision + recall) / (precision + recall), in the results of the "compute_ap" function that returns in addition to the Average Precision (AP), it also returns a list of precisions and a list of recalls, so for each returned AP I make an average, and at the end I make an average of all AP's to calculate the mean Average Precision (mAP), I only make a simple average because my problem is only a class, if your problem is multiple classes you should calculate by class.

The second way to calculate the F1-score I assume that the "compute_recall" function in utils.py returns the Average Recall (AR). That way I do the same thing as I did with the AP, and in the end I return to mean Average Recall (mAR), then I calculate the recall over mAP and mAR. Of course, this form can be totally wrong if "compute_recall" does not return Average Recall (AR).

Hope this helps!

WillianaLeite on 24 May 2020

@WillianaLeite Sorry, it was my week off. Unfortunatly for you, I think that the compute_recall calculates the recall only, and not the AR :/. And as you may already know, recall and average recall are 2 different thing. BUT, as I already told you, i'm not an expert either, so, you should finetune the answer.

https://stats.stackexchange.com/questions/353116/average-precision-vs-precision

Here is a well explained difference between AP and P, but it goes the same for AR and R.

suchiz on 26 May 2020

👍1

Hi @Nikk-27 after doing the training I create a data set and create a class to configure the network, as shown in the following code. As I mentioned in the topic of the question I have doubts if I am calculating the f1-score correctly. I presented two ways to calculate it, and both obtained very different results, the first way was that @suchiz suggested, who is also not sure, and the second way is using the utils.py "compute_recall" function, one of the two ways is correct, I just don't know which is, so I came to ask for help in that community.

This is my complete code, with two ways to calculate the f1-score:
from os import listdir
from xml.etree import ElementTree
from numpy import zeros
from numpy import asarray
from numpy import expand_dims
from numpy import mean
from mrcnn.config import Config
from mrcnn.model import MaskRCNN
from mrcnn.utils import Dataset
from mrcnn.utils import compute_ap, compute_recall
from mrcnn.model import load_image_gt
from mrcnn.model import mold_image



class PredictionConfig(Config):
  NAME = "acerola_cfg"
  NUM_CLASSES = 1 + 1
  GPU_COUNT = 1
  IMAGES_PER_GPU = 1


def evaluate_model(dataset, model, cfg):
  APs = list(); 
  F1_scores = list(); 
  for image_id in dataset.image_ids:
      image, image_meta, gt_class_id, gt_bbox, gt_mask = load_image_gt(dataset, cfg, image_id, use_mini_mask=False)
      scaled_image = mold_image(image, cfg)
      sample = expand_dims(scaled_image, 0)
      yhat = model.detect(sample, verbose=0)
      r = yhat[0]
      AP, precisions, recalls, overlaps = compute_ap(gt_bbox, gt_class_id, gt_mask, r["rois"], r["class_ids"], r["scores"], r['masks'])
                AR, positive_ids = compute_recall(r["rois"], gt_bbox, iou=0.2)
                ARs.append(AR)
      F1_scores.append((2* (mean(precisions) * mean(recalls)))/(mean(precisions) + mean(recalls)))
      APs.append(AP)

        mAP = mean(APs)
        mAR = mean(ARs)
        return mAP, mAR, F1_scores


test_set = AcerolaDataset()
test_set.load_dataset('path', is_train=False)
test_set.prepare()

cfg = PredictionConfig()
model = MaskRCNN(mode='inference', model_dir='/content/drive/My Drive/TCC-Mask-RCNN/', config=cfg)
model.load_weights('/content/drive/My Drive/TCC-Mask-RCNN/config-sem-flores-modificado/acerola_cfg20200523T2056/mask_rcnn_acerola_cfg_0056.h5', by_name=True)
mAP, mAR, F1_score = evaluate_model(test_set, model, cfg)
print("mAP: %.3f" % mAP)
print("mAR: %.3f" % mAR)
print("first way calculate f1-score: ", F1_score)

F1_score_2 = (2 * mAP * mAR)/(mAP + mAR)
print('second way calculate f1-score_2: ', F1_score_2)
Returns:

mAP: 0,934 mAR: 0.942 first way calculate f1-score: 0.66 second way calculate f1-score_2: 0.938

Being the first way @suchiz suggested: apply the formula of the f1-score: (2 * precision + recall) / (precision + recall), in the results of the "compute_ap" function that returns in addition to the Average Precision (AP), it also returns a list of precisions and a list of recalls, so for each returned AP I make an average, and at the end I make an average of all AP's to calculate the mean Average Precision (mAP), I only make a simple average because my problem is only a class, if your problem is multiple classes you should calculate by class.

The second way to calculate the F1-score I assume that the "compute_recall" function in utils.py returns the Average Recall (AR). That way I do the same thing as I did with the AP, and in the end I return to mean Average Recall (mAR), then I calculate the recall over mAP and mAR. Of course, this form can be totally wrong if "compute_recall" does not return Average Recall (AR).

Hope this helps!

Hi @WillianaLeite , I have a similar issue to you: my model also predicts one object from images only, and mAP is 0.933, F1-score is 0.641 via your 1st method but 0.936 via the 2nd method.

By double-checking the testing results, the predicted boxes overlap well with the ground-truth boxes for most of the images. Therefore, I believe the mAP of 0.933 corresponds to the observation and I am expecting high provisions, recalls, or F1-score. However, it is only around 0.6 from method 1 which confused me a lot.

Please share your experience if you have more thoughts on it. Thanks.

Yang-Yin on 26 May 2020

👍1

def compute_ap(gt_boxes, gt_class_ids, gt_masks,
               pred_boxes, pred_class_ids, pred_scores, pred_masks,
               iou_threshold=0.5):

    # Get matches and overlaps
    gt_match, pred_match, overlaps = compute_matches(
        gt_boxes, gt_class_ids, gt_masks,
        pred_boxes, pred_class_ids, pred_scores, pred_masks,
        iou_threshold)

    # Compute precision and recall at each prediction box step
    precisions = np.cumsum(pred_match > -1) / (np.arange(len(pred_match)) + 1)
    recalls = np.cumsum(pred_match > -1).astype(np.float32) / len(gt_match)

    # Pad with start and end values to simplify the math
    precisions = np.concatenate([[0], precisions, [0]])
    recalls = np.concatenate([[0], recalls, [1]])

    # Ensure precision values decrease but don't increase. This way, the
    # precision value at each recall threshold is the maximum it can be
    # for all following recall thresholds, as specified by the VOC paper.
    for i in range(len(precisions) - 2, -1, -1):
        precisions[i] = np.maximum(precisions[i], precisions[i + 1])

    # Compute mean AP over recall range
    indices = np.where(recalls[:-1] != recalls[1:])[0] + 1
    mAP = np.sum((recalls[indices] - recalls[indices - 1]) *
                 precisions[indices])

    return mAP, precisions, recalls, overlaps

Hello, can you let me know how you initialized this function for each image? I need to calculate the precision and recall values, so where should I define this and what parameter should I pass? Can you please help me ?

rutuja1409 on 29 Aug 2020

Hi everyone, as far as I checked "compute_ap" function and "compute_recall" function codes, I think that there are an obvious difference between them. To compute iou, "compute_ap" function uses masks whereas "compute_recall" function uses bboxes. So, I think that it is a little strange to use recalls calculated by "compute_recall" function with precisions calculated by "compute_recall" function for calculating F1 score.Hence, It is better only using "compute_ap" function. However, I am also a beginner about machine learning. Please someone correct me if i am wrong.