The mAP computation code is similar as https://github.com/eriklindernoren/PyTorch-YOLOv3/blob/959e0ff43f5b82bdacef87f4240bae8415eac45b/test.py#L69
It is incorrect to average the AP for each sample, because AP is computed per-class. The right way is to rank all detected instances across the whole test set for each object class, compute AP for each class, and then average the AP.
Thanks for the feedback. I opened Issue #5 about this earlier. Currently only one precision-recall curve is generated per image in test.py, whereas like you say I believe we want one for each class in each image, and then the average of those APs is the mAP for that image.
I can try and make this correction myself, or we could try and use an off-the-shelf solution, though that would require more imports. I was studying this link to learn more: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
After reviewing more examples, I think I can copy the methods in this repo:
https://github.com/rafaelpadilla/Object-Detection-Metrics
There is another main difference: the mAP should be calculated across all images at once, rather than once per image the way it is now. So I'm going to try to fully replace the mAP code with a new one that calculates accumulated TP and FP vectors for each class, then produces 80 precision and recall curves for all the objects in the 5000 validation images at once.
Yeah, the evaluation code in that repo is correct. Looking forward for your updates!
It looks like the original code for AP from recall-precision is fine:
https://github.com/ultralytics/yolov3/blob/c43be7b350cfff8f2423547c6aa0e8d6db07061b/utils/utils.py#L129
So I left it alone and created a new function to call it once per class in commit c43be7b350cfff8f2423547c6aa0e8d6db07061b:
https://github.com/ultralytics/yolov3/blob/c43be7b350cfff8f2423547c6aa0e8d6db07061b/utils/utils.py#L82
# Find unique classes
unique_classes = np.unique(np.concatenate((pred_cls, target_cls), 0))
# Create Precision-Recall curve and compute AP for each class
ap = []
for c in unique_classes:
i = pred_cls == c
n_gt = sum(target_cls == c) # Number of ground truth objects
if sum(i) == 0:
ap.append(0)
else:
# Accumulate FPs and TPs
fpa = np.cumsum(1 - tp[i])
tpa = np.cumsum(tp[i])
# Recall
recall = tpa / (n_gt + 1e-16)
# Precision
precision = tpa / (tpa + fpa)
# AP from recall-precision curve
ap.append(compute_ap(recall, precision))
When I re-evaluate mAP it drops from 58.1 to 56.7 with this method however. Darknet reports 57.9. I currently combine true and predicted classes into the list of classes evaluated per image, perhaps I should only be using one or the other. I will have to experiment some more.
Maybe you should perform _per-class rank ordering_ instead of _per-image rank ordering_.
Taking VOC for example, the evaluation code will first produce a per-class prediction list over the whole test-set in the format (image_id, score, x0, y0, x1, y1), like:
0000.jpg 0.98 100 100 200 200 # the 1st instance of image 0000.jpg
0000.jpg 0.51 10 10 1000 1000 # the last instance of image 0000.jpg
0001.jpg 0.78 100 100 200 200 # the 1st instance of image 0001.jpg
0001.jpg 0.05 10 10 1000 1000 # the last instance of image 0001.jpg
...
then perform rank ordering for all instances:
_image-id score_
0000.jpg 0.98
0001.jpg 0.78
0000.jpg 0.51
...
0001.jpg 0.05
then compute and accumulate TPs and FPs ... In this way, the mAP should be higher than per-image rank ordering (and no doubt the authors of yolo said _mAP is screwed up_ xD)
@xyutao I'm updating the mAP code, to both add corrections to the repo mAP calculation, and also to output the COCO JSON file to pass to allow the cocoapi to compute the official mAP.
Before you recommended switching to per class rank ordering from per image rank ordering. Is this still your recommendation? Do you know if this is how COCO computes mAP?
I think the ordering is performed for each class in each image independently. It is performed in the following lines
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L155-L158
which calls
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L260
@xyutao @okanlv I just noticed, the pycocotools demo notebook is selecting a subset of the entire validation set, just as we want for yolov3, since darknet only validates on the 5000 images in 5k.txt. You can see in the notebook that pycocotools states it is running a _per image_ evaluation.
https://github.com/cocodataset/cocoapi/blob/master/PythonAPI/pycocoEvalDemo.ipynb

@xyutao ๏ผไฝ ๆๆ็ฝ่ฟไธช้ฎ้ขไบๅ๏ผๆ็ งไผ ็ป็map่ฎก็ฎๆนๆณ็่ฏ๏ผ็็กฎๅบ่ฏฅๆฏๆ็ งไฝ ่ฏด็ๅฐฑๆฏๆๆๆๅพ็ไธญ็ๅฏ่ฝ็ฉไฝ้ฝๆฃๆตๅบๆฅ็ถๅๆ นๆฎconfidence่ฟ่กๆๅบ๏ผ็ถๅไพๆฌก้ๅไปไปฌไฝไธบๆญฃไพๆ ทๆฌๅนถ่ฎก็ฎๅฌๅ็ๅๅ็กฎ็๏ผๆๅ่ฎก็ฎๆฏไธไธช็ฑปๅซ็apใไธ่ฟๆ็ไบ่ฟไธชไปฃ็ ไฝ่ ็็ธๅ ณๅ็ญๅนถ็ปๅcoco api็่ฏฆ็ปไปฃ็ ๅๅ็ฐๆ่ฎธ่ฎก็ฎๆฏไธๅผ ๅพ็็apๅๅนณๅๅฐไธ่ตทๆ่ฎธๆๆฏcoco็map่ฎก็ฎๆนๆณใไฝๆ็ฎๅๆฒก็ๅฐ็ธๅ ณ็ๆญฃๅผๆๆกฃๅปๅ็กฎๆ่ฟฐcoco็map่ฎก็ฎๆนๆณ๏ผๆ่ตท็ ๆพ้ไธญๆๆ็ดข็็ฎๆ ๆฃๆตmap่ฎก็ฎๆ่ฟฐๅๅ็ฐ้ฝๆฏไผ ็ป็่ฎก็ฎๆนๆณ
@guagen COCO็APIไนๆฏ็จไผ ็ปๆนๆณ็ฎ็ใๅฎๅ ่ฐ็จevaluateImgๅฝๆฐ๏ผๅฏนๅๅผ ๅพ็ๆฏไธช็ฑป๏ผ่ฎก็ฎๅๆฃๆตๆกๅ็ๅผ็matchๆ ๅต๏ผ็ถๅ่ฐ็จaccumulateๅฝๆฐ๏ผๅฏนๅไธไธช็ฑป็ๆๆๅพ็็match็ปๆ่ฟ่กๅๅนถ๏ผๅนถๆ็ งๆฃๆตๆก็ๅพๅ่ฟ่ก้ๅบ๏ผๆๅๅ็ปไธ่ฎก็ฎprecision-recallใ
evaluateImgๅฝๆฐ็่ฟๅ็ปๆ่ฏฆ่ง๏ผhttps://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L302
ๅฏนๅไธช็ฑปๅๅนถๆๆๅพ็ๆฃๆตๆกๅพๅ็ไปฃ็ ๏ผ
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363
ๅฏนๆฃๆตๆกๅพๅ้ๅบๆๅ็ไปฃ็ ๏ผ
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L367
@glenn-jocher The per-image evaluation just matches the detections and gt for each category, as shown in the evaluateImg function:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L236
The recall and precision are computed per category, by accumulating the matching and ranking the detection scores for all images of the category. See:
https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L363
@xyutao ๅฅฝๅง๏ผๅคด็ผ๏ผ็ๆฅ่ฟไธชไปฃ็ ็ไฝ่ ๅ้ไปฃ็ ไบ๏ผ่ไธๆ็ งไป็่ฟ็งๅๆณ๏ผๅพๅบ็mapไธ็ญไบๆฏไธ็ฑป็ๅ็ฌapๅ ๅฐไธ่ตทๅ้คไปฅ็ฑปๅซๆปๆฐ
@glenn-jocher Here I paste the key code from COCO API for accumulating dets as follows:
(ref: https://github.com/cocodataset/cocoapi/blob/ed842bffd41f6ff38707c4f0968d2cfd91088688/PythonAPI/pycocotools/cocoeval.py#L354)

In this fraction, k_list stores the category ids, a_list stores the area ranges, m_list stores max detections, i_list stores the image ids. The outside for-loop is per-category iteration, while the inside for-loop is to accumulate the matching results of each image:
E = [self.evalImgs[Nk + Na + i] for i in i_list]
then merge the scores of all detections:
dtScores = np.concatenate([e['dtScores'][0:maxDet] for e in E])
and rank them in decline order:
inds = np.argsort(-dtScores, kind='mergesort')
The matching results pre-computed at _evaluateImg_ are merged:
dtm = np.concatenate([e['dtMatches'][:,0:maxDet] for e in E], axis=1)[:,inds]
and tps and fps are initialized with the merged matching:
tps = np.logical_and( dtm, np.logical_not(dtIg) )
fps = np.logical_and(np.logical_not(dtm), np.logical_not(dtIg) )
...
I think it should be a fairly straightforward change to test.py to calculate the mAP averaged per class rather than averaged per image. I can try and implement this in the next few days. Luckily the new --save-json argument in test.py outputs the official COCO mAP now, so this will make comparison of the updates easier.
We probably want to lower the default --conf-thresh from 0.3 to something like 0.001 as well.
@glenn-jocher
I can't wait to see these changes. Those will make this code look even better. :)
Anyway, thank for your excellent work.
All, I created a new https://github.com/ultralytics/yolov3/tree/map_update branch to test mAP updates. I converted from image-averaging to class-averaging. The result is 0.519 mAP now vs 0.550 pycocotools mAP using official yolov3.weights. I'm not sure where the discrepancy lies.
This mainly affects custom data, as for COCO data we can simply use --save-json flag in test.py, which produces an essentially identical 0.550 mAP to the reported 0.553 mAP . One option for custom data would be to attempt to create both predictions and annotations json files, and then to use pycocotools with both of the json files. I tried this, but ran into problem producing the annotations file, it seems more complicated than the targets json produced by --save-json.
Another item is that the previous mAP calculation could operate at a reasonable conf_thres of 0.1-0.5. The new mAP calculation requires an extremely low conf_thres for best results (as is the case with pycocotools mAP as well), around conf_thres=0.001, which takes longer to compute (from 1 min per epoch to 6 minutes on a V100). This produces an extreme number of False Positives (FPs), all requiring NMS (which is currently on CPU). I estimate that for each true detection there are about 15 false positives. It's beyond me how this garbage dump of FPs has been prioritized by the COCO organizers as their metric of choice, but it seems to be the only number anyone cares about.
rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 157/157 [06:34<00:00, 1.93s/it]
5000 5000 0.0865 0.727 0.519
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.550
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.336
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.267
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.408
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590
UPDATE: difference narrowed down to 0.531 (repo calculation) vs 0.551 (pycocotools). The obj_conf used affects the mAP: whether it is multiplied by class_conf, and if so whether that class_conf is produced by sigmoid or softmax.
rm -rf yolov3 && git clone -b map_update --depth 1 https://github.com/ultralytics/yolov3 yolov3
python3 test.py --conf-thres 0.001 --save-json
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 157/157 [07:00<00:00, 2.09s/it]
5000 5000 0.0865 0.727 0.531
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.308
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.551
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.308
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.143
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.334
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.455
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.267
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.407
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.432
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.240
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.470
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590
We've made great efforts to align the repo mAP with the COCO mAP. It's not perfect, but the current result seems to steadily track about 2% lower than the COCO mAP as the epochs trend higher. We are going to stop updating the repo mAP code now, merge the development branch with the master and release v4.0.
This plot shows v4.0 training with the repo mAP (blue) overlaid with the pycocotools mAP (orange), using --conf_thres 0.1 and default training settings. We decided to use --conf_thres 0.1 to increase test speed during training, while leaving the test.py default at --conf_thres 0.001 for the highest mAP when run by hand later on.

UPDATE: difference narrowed down to 0.550 (repo calculation) vs 0.549 (pycocotools).
python3 test.py
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=False, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_process
or_count=80)
Image Total P R mAP
Calculating mAP: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 157/157 [06:05<00:00, 1.82s/it]
5000 5000 0.11 0.746 0.55
...
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.549
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.309
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.142
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.335
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.454
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.266
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.406
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.429
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.236
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.466
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.586
Results on first part of Single Class Tutorial (in Wiki). Blue is repo mAP, orange is pycocotools mAP.

Final results are in, and PR https://github.com/ultralytics/yolov3/pull/176 complete. Repo mAP now aligns with COCO mAP under most circumstances to within 1%. Also mAP output now exceeds yolov3 darknet published results. I will close the issue finally unless there are any other questions.
| ultralytics/yolov3 with pycocotools | darknet/yolov3
--- | --- | ---
YOLOv3-320 | 51.8 | 51.5
YOLOv3-416 | 55.4 | 55.3
YOLOv3-608 | 58.2 | 57.9
sudo rm -rf yolov3 && git clone https://github.com/ultralytics/yolov3
# bash yolov3/data/get_coco_dataset.sh
sudo rm -rf cocoapi && git clone https://github.com/cocodataset/cocoapi && cd cocoapi/PythonAPI && make && cd ../.. && cp -r cocoapi/PythonAPI/pycocotools yolov3
cd yolov3
python3 test.py --save-json --conf-thres 0.001 --img-size 416
Namespace(batch_size=32, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=416, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
Calculating mAP: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 157/157 [08:34<00:00, 2.53s/it]
5000 5000 0.0896 0.756 0.555
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.312
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.554
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.317
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.145
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.343
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.452
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.411
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.435
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.244
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.477
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.587
python3 test.py --save-json --conf-thres 0.001 --img-size 608 --batch-size 16
Namespace(batch_size=16, cfg='cfg/yolov3.cfg', conf_thres=0.001, data_cfg='cfg/coco.data', img_size=608, iou_thres=0.5, nms_thres=0.5, save_json=True, weights='weights/yolov3.weights')
Using cuda _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', major=7, minor=0, total_memory=16130MB, multi_processor_count=80)
Image Total P R mAP
Calculating mAP: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 313/313 [08:54<00:00, 1.55s/it]
5000 5000 0.0966 0.786 0.579
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.331
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.582
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.344
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.198
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.362
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.427
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.281
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.437
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.463
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.309
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.494
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.577