yolov3 🚀 - mAP Computation vs Pycocotools

Thanks for the feedback. I opened Issue #5 about this earlier. Currently only one precision-recall curve is generated per image in test.py, whereas like you say I believe we want one for each class in each image, and then the average of those APs is the mAP for that image.

I can try and make this correction myself, or we could try and use an off-the-shelf solution, though that would require more imports. I was studying this link to learn more: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

glenn-jocher on 7 Sep 2018

After reviewing more examples, I think I can copy the methods in this repo:
https://github.com/rafaelpadilla/Object-Detection-Metrics

There is another main difference: the mAP should be calculated across all images at once, rather than once per image the way it is now. So I'm going to try to fully replace the mAP code with a new one that calculates accumulated TP and FP vectors for each class, then produces 80 precision and recall curves for all the objects in the 5000 validation images at once.

glenn-jocher on 8 Sep 2018

Yeah, the evaluation code in that repo is correct. Looking forward for your updates!

xyutao on 10 Sep 2018

It looks like the original code for AP from recall-precision is fine:
https://github.com/ultralytics/yolov3/blob/c43be7b350cfff8f2423547c6aa0e8d6db07061b/utils/utils.py#L129

So I left it alone and created a new function to call it once per class in commit c43be7b350cfff8f2423547c6aa0e8d6db07061b:
https://github.com/ultralytics/yolov3/blob/c43be7b350cfff8f2423547c6aa0e8d6db07061b/utils/utils.py#L82

    # Find unique classes
    unique_classes = np.unique(np.concatenate((pred_cls, target_cls), 0))

    # Create Precision-Recall curve and compute AP for each class
    ap = []
    for c in unique_classes:
        i = pred_cls == c
        n_gt = sum(target_cls == c)  # Number of ground truth objects

        if sum(i) == 0:
            ap.append(0)
        else:
            # Accumulate FPs and TPs
            fpa = np.cumsum(1 - tp[i])
            tpa = np.cumsum(tp[i])

            # Recall
            recall = tpa / (n_gt + 1e-16)

            # Precision
            precision = tpa / (tpa + fpa)

            # AP from recall-precision curve
            ap.append(compute_ap(recall, precision))

When I re-evaluate mAP it drops from 58.1 to 56.7 with this method however. Darknet reports 57.9. I currently combine true and predicted classes into the list of classes evaluated per image, perhaps I should only be using one or the other. I will have to experiment some more.

glenn-jocher on 10 Sep 2018

Maybe you should perform _per-class rank ordering_ instead of _per-image rank ordering_.

Taking VOC for example, the evaluation code will first produce a per-class prediction list over the whole test-set in the format (image_id, score, x0, y0, x1, y1), like:

0000.jpg 0.98 100 100 200 200 # the 1st instance of image 0000.jpg
0000.jpg 0.51 10 10 1000 1000 # the last instance of image 0000.jpg
0001.jpg 0.78 100 100 200 200 # the 1st instance of image 0001.jpg
0001.jpg 0.05 10 10 1000 1000 # the last instance of image 0001.jpg
...

then perform rank ordering for all instances:
_image-id score_
0000.jpg 0.98
0001.jpg 0.78
0000.jpg 0.51
...
0001.jpg 0.05

then compute and accumulate TPs and FPs ... In this way, the mAP should be higher than per-image rank ordering (and no doubt the authors of yolo said _mAP is screwed up_ xD)

xyutao on 11 Sep 2018

@xyutao I'm updating the mAP code, to both add corrections to the repo mAP calculation, and also to output the COCO JSON file to pass to allow the cocoapi to compute the official mAP.

Before you recommended switching to per class rank ordering from per image rank ordering. Is this still your recommendation? Do you know if this is how COCO computes mAP?

glenn-jocher on 25 Feb 2019

I think the ordering is performed for each class in each image independently. It is performed in the following lines
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L155-L158
which calls
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L260

okanlv on 25 Feb 2019

@xyutao @okanlv I just noticed, the pycocotools demo notebook is selecting a subset of the entire validation set, just as we want for yolov3, since darknet only validates on the 5000 images in 5k.txt. You can see in the notebook that pycocotools states it is running a _per image_ evaluation.

https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb
screenshot 2019-02-25 at 22 55 15

glenn-jocher on 25 Feb 2019

@xyutao ，你搞明白这个问题了吗？按照传统的map计算方法的话，的确应该是按照你说的就是把所有图片中的可能物体都检测出来然后根据confidence进行排序，然后依次选取他们作为正例样本并计算召回率和准确率，最后计算每一个类别的ap。不过我看了这个代码作者的相关回答并结合coco api的详细代码后发现或许计算每一张图片的ap再平均到一起或许才是coco的map计算方法。但我目前没看到相关的正式文档去准确描述coco的map计算方法，最起码找遍中文搜索的目标检测map计算描述后发现都是传统的计算方法

guagen on 14 Mar 2019

👍1

@guagen COCO的API也是用传统方法算的。它先调用evaluateImg函数，对单张图的每个类，计算各检测框和真值的match情况，然后调用accumulate函数，对同一个类的所有图片的match结果进行合并，并按照检测框的得分进行降序，最后再统一计算precision-recall。

evaluateImg函数的返回结果详见：https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L302

对单个类合并所有图片检测框得分的代码：
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363

对检测框得分降序排列的代码：
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L367

xyutao on 14 Mar 2019

@glenn-jocher The per-image evaluation just matches the detections and gt for each category, as shown in the evaluateImg function:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L236

The recall and precision are computed per category, by accumulating the matching and ranking the detection scores for all images of the category. See:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363

xyutao on 14 Mar 2019

@xyutao 好吧，头疼，看来这个代码的作者写错代码了，而且按照他的这种写法，得出的map不等于每一类的单独ap加到一起再除以类别总数

guagen on 14 Mar 2019

@glenn-jocher Here I paste the key code from COCO API for accumulating dets as follows:
(ref: https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L354)

In this fraction, k_list stores the category ids, a_list stores the area ranges, m_list stores max detections, i_list stores the image ids. The outside for-loop is per-category iteration, while the inside for-loop is to accumulate the matching results of each image:
E = [self.evalImgs[Nk + Na + i] for i in i_list]
then merge the scores of all detections:
dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
and rank them in decline order:
inds = np.argsort(-dtScores, kind='mergesort')
The matching results pre-computed at _evaluateImg_ are merged:
dtm = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]
and tps and fps are initialized with the merged matching:
tps = np.logical_and( dtm, np.logical_not(dtIg) )
fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )
...

xyutao on 14 Mar 2019

I think it should be a fairly straightforward change to test.py to calculate the mAP averaged per class rather than averaged per image. I can try and implement this in the next few days. Luckily the new --save-json argument in test.py outputs the official COCO mAP now, so this will make comparison of the updates easier.

We probably want to lower the default --conf-thresh from 0.3 to something like 0.001 as well.

glenn-jocher on 14 Mar 2019

@glenn-jocher
I can't wait to see these changes. Those will make this code look even better. :)
Anyway, thank for your excellent work.

guagen on 15 Mar 2019

All, I created a new https://github.com/ultralytics/yolov3/tree/map_update branch to test mAP updates. I converted from image-averaging to class-averaging. The result is 0.519 mAP now vs 0.550 pycocotools mAP using official yolov3.weights. I'm not sure where the discrepancy lies.

This mainly affects custom data, as for COCO data we can simply use --save-json flag in test.py, which produces an essentially identical 0.550 mAP to the reported 0.553 mAP . One option for custom data would be to attempt to create both predictions and annotations json files, and then to use pycocotools with both of the json files. I tried this, but ran into problem producing the annotations file, it seems more complicated than the targets json produced by --save-json.

Another item is that the previous mAP calculation could operate at a reasonable conf_thres of 0.1-0.5. The new mAP calculation requires an extremely low conf_thres for best results (as is the case with pycocotools mAP as well), around conf_thres=0.001, which takes longer to compute (from 1 min per epoch to 6 minutes on a V100). This produces an extreme number of False Positives (FPs), all requiring NMS (which is currently on CPU). I estimate that for each true detection there are about 15 false positives. It's beyond me how this garbage dump of FPs has been prioritized by the COCO organizers as their metric of choice, but it seems to be the only number anyone cares about.

rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')

Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)

      Image      Total          P          R        mAP
100%|████████████████████████████████████████████████████████████████████████████| 157/157 [06:34<00:00,  1.93s/it]
       5000       5000     0.0865      0.727      0.519

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.550
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.336
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.408
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

glenn-jocher on 26 Mar 2019

UPDATE: difference narrowed down to 0.531 (repo calculation) vs 0.551 (pycocotools). The obj_conf used affects the mAP: whether it is multiplied by class_conf, and if so whether that class_conf is produced by sigmoid or softmax.

rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')

Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 157/157 [07:00<00:00,  2.09s/it]
       5000       5000     0.0865      0.727      0.531

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.551
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.308
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.267
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.407
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

glenn-jocher on 26 Mar 2019

We've made great efforts to align the repo mAP with the COCO mAP. It's not perfect, but the current result seems to steadily track about 2% lower than the COCO mAP as the epochs trend higher. We are going to stop updating the repo mAP code now, merge the development branch with the master and release v4.0.

This plot shows v4.0 training with the repo mAP (blue) overlaid with the pycocotools mAP (orange), using --conf_thres 0.1 and default training settings. We decided to use --conf_thres 0.1 to increase test speed during training, while leaving the test.py default at --conf_thres 0.001 for the highest mAP when run by hand later on.
Screenshot 2019-03-29 at 12 13 14

glenn-jocher on 29 Mar 2019

UPDATE: difference narrowed down to 0.550 (repo calculation) vs 0.549 (pycocotools).

python3 test.py
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=False, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_process
or_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [06:05<00:00,  1.82s/it]
       5000       5000       0.11      0.746       0.55
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.549
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.309
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.454
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.266
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.406
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.236
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.466
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586

Results on first part of Single Class Tutorial (in Wiki). Blue is repo mAP, orange is pycocotools mAP.
Screenshot 2019-03-29 at 12 13 14

glenn-jocher on 30 Mar 2019

Final results are in, and PR https://github.com/ultralytics/yolov3/pull/176 complete. Repo mAP now aligns with COCO mAP under most circumstances to within 1%. Also mAP output now exceeds yolov3 darknet published results. I will close the issue finally unless there are any other questions.

| ultralytics/yolov3 with pycocotools | darknet/yolov3
--- | --- | ---
YOLOv3-320 | 51.8 | 51.5
YOLOv3-416 | 55.4 | 55.3
YOLOv3-608 | 58.2 | 57.9

sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
# bash yolov3/data/get_coco_dataset.sh
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3

python3 test.py --save-json --conf-thres 0.001 --img-size 416
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 157/157 [08:34<00:00,  2.53s/it]
       5000       5000     0.0896      0.756      0.555
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.312
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.554
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.317
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.145
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.452
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.411
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.435
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.244
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.477
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.587

python3 test.py --save-json --conf-thres 0.001 --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
      Image      Total          P          R        mAP
Calculating mAP: 100%|█████████████████████████████████| 313/313 [08:54<00:00,  1.55s/it]
       5000       5000     0.0966      0.786      0.579
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.331
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.582
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.344
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.198
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.362
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.427
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.281
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.437
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.463
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.494
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.577

glenn-jocher on 30 Mar 2019

👍1

Yolov3: mAP Computation vs Pycocotools

All 20 comments

Related issues