Maskrcnn-benchmark: Provided Pascal evaluation code get very diffferent results with Official Pascal Evaluation Method?

Created on 10 Dec 2018 · 26Comments · Source: facebookresearch/maskrcnn-benchmark

❓ Questions and Help

When I use provided voc evaluation code for evaluating results of Pascal Voc(trained with 07+12trainval and test on 07test), I use your codes will get 82.6 mAP(Faster-RCNN-FPN-ResNet101). However, if I use official evaluation codes provided in https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/datasets/voc_eval.py, I got different results, not sure whether you can check it and support using official evaluation codes? Thank you so much

contributions welcome help wanted

Source

xuw080

All 26 comments

Hi,

One thing to verify is that there are two modes for evaluation: using voc2007 metric or using voc2012 metric.
From experience, both metrics give different results, as the 07 metric is much coarser when evaluating.

Can you see if this is the reason for the difference? If not, then I'd be very interested in get the differences removed.

fmassa on 10 Dec 2018

Hi, I double check it and I am using 07 metric. The evaluation method of your code will always provide higher results. Using Faster-rcnn-resnet50 will get 79.5 mAP in your pascal evaluation methods, which is much higher than any publication results, i.e. 79.8 for resnet101 and ~76.8 for resnet50. After testing with official codes, I found that the results dropped a lot. I also tested results with coco evaluation methods, the performance is 43.3 mAP, 72.7 [email protected], 46.6 [email protected]. I am not sure whether it is possible to simply use official evaluation codes provides in https://github.com/rbgirshick/py-faster-rcnn/blob/master/lib/datasets/voc_eval.py? Thank you again for your kind reply and very nice implementation.

xuw080 on 10 Dec 2018

Hi,

Thanks for pointing this out.

I haven't myself implemented those evaluation metrics, it as a contributor that added this feature.

We should definitely find the reason why there is so much discrepancy, those functions where originally taken from py-faster-rcnn, so there shouldn't be much difference.

fmassa on 10 Dec 2018

cc @lufficc who originally sent the PR with the implementation

fmassa on 10 Dec 2018

When I tested on my side, I got 0.7306(official) vs 0.7332. Below is the details. Is this within the allowable error range? I didn't use offical code, since it write a lot of tmp files and the code is not "fit" this project. @xuw080 @fmassa

Official:

AP for aeroplane = 0.7395
AP for bicycle = 0.8198
AP for bird = 0.7462
AP for boat = 0.5540
AP for bottle = 0.5976
AP for bus = 0.7942
AP for car = 0.8425
AP for cat = 0.8490
AP for chair = 0.5393
AP for cow = 0.8049
AP for diningtable = 0.6299
AP for dog = 0.8432
AP for horse = 0.8380
AP for motorbike = 0.7992
AP for person = 0.8206
AP for pottedplant = 0.4631
AP for sheep = 0.7499
AP for sofa = 0.6840
AP for train = 0.7690
AP for tvmonitor = 0.7280
Mean AP = 0.7306
~~~~~~~~
Results:
0.740
0.820
0.746
0.554
0.598
0.794
0.843
0.849
0.539
0.805
0.630
0.843
0.838
0.799
0.821
0.463
0.750
0.684
0.769
0.728
0.731
~~~~~~~~

--------------------------------------------------------------
Results computed with the **unofficial** Python eval code.
Results should be very close to the official MATLAB eval code.
Recompute with `./tools/reval.py --matlab ...` for your paper.
-- Thanks, The Management
--------------------------------------------------------------

Mine:

mAP: 0.7332
aeroplane       : 0.7401
bicycle         : 0.8159
bird            : 0.7521
boat            : 0.5592
bottle          : 0.6034
bus             : 0.7943
car             : 0.8438
cat             : 0.8495
chair           : 0.5447
cow             : 0.8090
diningtable     : 0.6411
dog             : 0.8389
horse           : 0.8412
motorbike       : 0.8009
person          : 0.8240
pottedplant     : 0.4681
sheep           : 0.7510
sofa            : 0.6863
train           : 0.7733
tvmonitor       : 0.7279

lufficc on 11 Dec 2018

Thank you for your reply, in fact, I got different results. I train with 07trainval+12trainval and tested on 07 test. For Faster-RCNN-R-50-C4, I got 79.47 mAP(voc evaluation), 73.3 mAP(official evaluation) and 43.3 mAP, 72.7 [email protected] and 46.6 [email protected](coco evaluation). For Faster-RCNN-R-50-FPN, I got 81.34 mAP(voc evaluation), 74.3 mAP(official evaluation), and 45 mAP, 74.1 [email protected] and 48.6 [email protected](coco evaluation). For Faster-RCNN-R-101-FPN, I got 82.56 mAP(voc evaluation), 75.8 mAP(official evaluation), and 47.7 mAP, 75.5 [email protected], 52.5 [email protected](coco evaluation). For Faster-RCNN-ResXt-101-FPN, I got 83.19 mAP(voc evaluation), 75.9 mAP(official evaluation), and 48.1 mAP, 75.1 [email protected], 53.2 [email protected](coco evaluation). All codes are trained with 8 gpus, each holds two images. lr=0.02 and weight decay is 10k and max iters is 14.5k.(decay in 5 epochs and max is ~7 epochs, epochs number is the same as provided config file). I didn't change any codes.
This is the evaluation results with RCNN-R-50-C4:

2018-11-26 02:01:42,876 maskrcnn_benchmark.utils.checkpoint INFO: Saving checkpoint to /mnt/HeadNode-1/user/xudong/models/detectron_models/R-50-C4/VOC/0/model_0014500.pth
2018-11-26 02:01:43,753 maskrcnn_benchmark.trainer INFO: Total training time: 4:48:23.790570 (1.1934 s / it)
2018-11-26 02:02:27,432 maskrcnn_benchmark.inference INFO: Start evaluation on voc_2007_test dataset(4952 images).
2018-11-26 02:03:41,848 maskrcnn_benchmark.inference INFO: Total inference time: 0:01:14.415771 (0.12021933926904337 s / img per device, on 8 devices)
2018-11-26 02:03:42,943 maskrcnn_benchmark.inference INFO: performing voc evaluation, ignored iou_types.
2018-11-26 02:03:57,257 maskrcnn_benchmark.inference INFO: mAP: 0.7947
aeroplane : 0.8436
bicycle : 0.8632
bird : 0.7829
boat : 0.6946
bottle : 0.6666
bus : 0.8577
car : 0.8816
cat : 0.8831
chair : 0.6417
cow : 0.8653
diningtable : 0.7266
dog : 0.8674
horse : 0.8789
motorbike : 0.8503
person : 0.8591
pottedplant : 0.5115
sheep : 0.8295
sofa : 0.7705
train : 0.8445
tvmonitor : 0.7750
@lufficc @fmassa

xuw080 on 11 Dec 2018

Do you use official code correctly？Could I have a look at your code to evaluate using official code? Then we may find the problem，because I got similar result when using official code.

lufficc on 11 Dec 2018

I will save all the detection results of each class to be in the format of:
image_id + " " + score + " " + xmin + " " + ymin + " " + xmax + " " + ymax + "\n".
Then save all the same class predictions in the same txt file as 'ClassName'_test.txt. Then simply use official voc_eval.py to read all these files to get the official evaluation results.

I am not sure whether you can send your evaluation code to me, I can test it with your implementation methods and provide the new testing results on all of my experiments.

In current voc evaluation methods, Faster-RCNN-ResNet50's mAP reached 0.7947, which is much higher than any publication results. The highest results for Faster-RCNN-ResNet101 in published work is 79.8, ResNet50 should reach about 76~77 mAP. So, I don't think 79.47 is reasonable, also, I tested with COCO evaluation methods(using coco format pascal voc 07 test provided in https://github.com/facebookresearch/multipathnet/blob/master/README.md), the [email protected] for coco evaluation methods, i.e, 72.7 [email protected], is also very different from the voc evaluation results.

xuw080 on 11 Dec 2018

Replace this https://github.com/facebookresearch/maskrcnn-benchmark/blob/55d3ab44aa9ff5261afb8244ab5e265266d8436f/maskrcnn_benchmark/data/datasets/evaluation/voc/voc_eval.py#L12 with:

def do_voc_evaluation(dataset, predictions, output_folder, logger):
    class_boxes = {dataset.map_class_id_to_class_name(i + 1): [] for i in range(20)}
    for image_id, prediction in tqdm(enumerate(predictions)):
        img_info = dataset.get_img_info(image_id)
        if len(prediction) == 0:
            continue
        image_width = img_info["width"]
        image_height = img_info["height"]
        prediction = prediction.resize((image_width, image_height))
        pred_bbox = prediction.bbox.numpy()
        pred_label = prediction.get_field("labels").numpy()
        pred_score = prediction.get_field("scores").numpy()

        for i, class_id in enumerate(pred_label):
            image_name = dataset.get_origin_id(image_id)
            box = pred_bbox[i]
            score = pred_score[i]
            class_name = dataset.map_class_id_to_class_name(class_id)
            class_boxes[class_name].append((image_name, box[0], box[1], box[2], box[3], score))
    aps = []
    tmp = os.path.join(output_folder, 'tmp')
    if not os.path.exists(tmp):
        os.makedirs(tmp)
    for key in dataset.CLASSES[1:]:
        filename = os.path.join(output_folder, '{}.txt'.format(key))
        if os.path.exists(filename):
            os.remove(filename)
        with open(filename, 'wt') as txt:
            boxes = class_boxes[key]
            for k in range(len(boxes)):
                box = boxes[k]
                txt.write('{:s} {:.3f} {:.1f} {:.1f} {:.1f} {:.1f}\n'.format(box[0], box[-1], box[1], box[2], box[3], box[4]))
        devkit_path = '/data7/lufficc/voc/VOCdevkit/VOC2007'
        annopath = os.path.join(devkit_path, 'Annotations', '{:s}.xml')
        imagesetfile = os.path.join(devkit_path, 'ImageSets', 'Main', 'test.txt')
        rec, prec, ap = voc_eval(filename, annopath, imagesetfile, key, tmp, ovthresh=0.5, use_07_metric=True)
        aps += [ap]
        print(('AP for {} = {:.4f}'.format(key, ap)))
    print(('Mean AP = {:.4f}'.format(np.mean(aps))))
    print('~~~~~~~~')
    print('Results:')
    for ap in aps:
        print(('{:.3f}'.format(ap)))
    print(('{:.3f}'.format(np.mean(aps))))
    print('~~~~~~~~')
    print('')
    print('--------------------------------------------------------------')
    print('Results computed with the **unofficial** Python eval code.')
    print('Results should be very close to the official MATLAB eval code.')
    print('Recompute with `./tools/reval.py --matlab ...` for your paper.')
    print('-- Thanks, The Management')
    print('--------------------------------------------------------------')

Also add

    def get_origin_id(self, index):
        img_id = self.ids[index]
        return img_id

in PascalVOCDataset

lufficc on 11 Dec 2018

Thanks, I will test it now.

xuw080 on 11 Dec 2018

👍1

Hi, I compared official evaluation methods and your methods with two different datasets Pascal VOC0712 and KITTI, I will use voc style format for training and testing both of them. For pascal voc0712, the difference between your method and official evaluation method is in the range of 0.3 to 0.5 mAP. For KITTI, the difference between official evaluation method and your method will be over 10 to 20 mAP. In fact, originally, I thought maybe the evaluation method is different from the official one is because I got much lower results on KITTI Official Benchmark even though I got higher mAP with your voc evaluation methods than my previous results got from other pytorch-faster-rcnn implementation(using official voc evaluation code). I double checked your method and your provided official evaluation codes, I didn't find the reason, but the gap exists, which are smaller for Pascal and larger for KITTI. Following is my testing results using your original evaluation code and your official evaluation code:

Original:
2018-12-11 15:05:36,741 maskrcnn_benchmark.inference INFO: mAP: 0.5585
pedestrian : 0.5833
cyclist : 0.4626
car : 0.8606
dontcare : 0.3273

Official:
2018-12-11 14:57:52,995 maskrcnn_benchmark.inference INFO: Total inference time: 0:01:39.622598 (0.10572841433765212 s / img per device, on 4 devices)
2018-12-11 14:57:54,024 maskrcnn_benchmark.inference INFO: performing voc evaluation, ignored iou_types.
AP for pedestrian = 0.3188
AP for cyclist = 0.2444
AP for car = 0.4326
AP for dontcare = 0.1824
Mean AP = 0.2945

The official evaluation results matched the results I got in KITTI Official Benchmark.
I didn't find the reason, but I think maybe we should stick with official evaluation methods as what you provided in previous answers. Anyhow, thank you so much for supplying these codes and kind reply, It really helped me a lot. @lufficc @fmassa

xuw080 on 12 Dec 2018

One question @xuw080 : are you using python 2?
It might be a matter of integer / float division differences between different python versions?

fmassa on 12 Dec 2018

I am using python3.7.

xuw080 on 12 Dec 2018

@lufficc were you using python 2 when evaluating the code?
Also, could you try once more to see if you manage to reproduce the same results as before, or if you get results similar to @xuw080 ?
Thanks!

fmassa on 13 Dec 2018

I am using python3.6. I tried again and got very close result. And I use the same code in my another project which got very close result compare to the official code. I cannot fingure out why @xuw080 got so big difference on KITTI dataset.

lufficc on 13 Dec 2018

Maybe the two lines below cause the huge difference? Because if no object detected, will not do evaluating.
https://github.com/facebookresearch/maskrcnn-benchmark/blob/55d3ab44aa9ff5261afb8244ab5e265266d8436f/maskrcnn_benchmark/data/datasets/evaluation/voc/voc_eval.py#L19-L20
@fmassa @xuw080

lufficc on 13 Dec 2018

That might be the reason? I'd need to compare against the original implementation, but good spot!
Could you see if removing it bring the results closer?

fmassa on 13 Dec 2018

Now I am sure is the two line cause this.
After I change it to :

if len(prediction) == 0 or image_id % 2 == 0:
    continue

The official code got 0.3840, ours got 0.7252. Without change to this, still close.
So If It didn't detect any object on many images, Official code got low mAP, ours still got higher because not take in account them, this why @xuw080 got higher mAP on KITTI.
@fmassa

lufficc on 14 Dec 2018

❤1

We should remove the if len(prediction) == 0 skipping, this is probably not right...

fmassa on 14 Dec 2018

Yee, We should remove the if len(prediction) == 0 skipping.

lufficc on 14 Dec 2018

does this break the rest of the evaluation pipeline?

fmassa on 14 Dec 2018

No, it doesn't.

lufficc on 14 Dec 2018

👍1

No progress?

drcege on 2 Jan 2019

@drcege I was on holidays and didn't have the chance to look into it.
@lufficc do you think that the changes that we have mentioned could be the fix? If yes, could you send a PR with the fix?

Thanks!

fmassa on 7 Jan 2019

Now the evaluation pipeline is right?