Vision: COCO AP of FPN with ResNet-50 backbone for object detection

Created on 9 Apr 2020 · 7Comments · Source: pytorch/vision

Hi @fmassa, thanks for the great codes.
I am confused about COCO AP of Faster R-CNN ResNet-50 FPN,
from Document and #925 and Source Code,
I guess that the model Faster R-CNN ResNet-50 FPN was trained with following hyperparameters and got AP 37.0, am I right?

batch_size = 2 * 8 (NUM_GPU) = 16

However, I noticed that the box AP in maskrcnn-benchmark and Detectron seems to have better performance as below:

| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:-----------------:|:--------------:|:--------:|
| maskrcnn-benchmark | R-50 FPN | 36.8 | 1x | 12.28 | 8.19, 10.92 | 16 | 0.02 |
| Detectron | R-50 FPN | 36.7 | 1x | 12.28 | 8.19, 10.92 | 16 | 0.02 |
| Detectron | R-50 FPN | 37.9 | 2x | 24.56 | 16.37, 21.83 | 16 | 0.02 |

from maskrcnn-benchmark 1x config
epochs = 90000 (steps) * 16 (batch size) / 117266 (training images per epoch) = 12.28
btw, COCO2017 has 118287 training images but only 117266 training images contain at least one object

I would like to know what causes this gap?

37.0 (torchvision 2x) vs 36.8 (maskrcnn-benchmark 1x)
37.0 (torchvision 2x) vs 37.9 (Detectron 2x)

Besides, could I have the result which trained with scheduler 1x?

Thank you!

models question object detection

Source

potterhsu

All 7 comments

Hi,

There are a few differences between both implementations that lead to this difference in mAP:

maskrcnn-benchmark and detectron2 uses ResNet50 Caffe2 pre-trained weights with stride in the 1x1 convolution, while torchvision models uses the torchvision resnet50 models (with stride in the 3x3 convolution)
we use l1 loss for both regression, instead of smooth_l1 -- this is for simplicity, but gives a different trade-off on mAP and mAP@50
we use slightly different weight initializations for some layers -- this is for simplicity

Those all cumulate to lead to this discrepancy that you see.
Given the complexity of Faster R-CNN as a model, every tiny detail can change a bit the dynamics of the training, while producing in the end (after more epochs) comparable models, so for the sake of uniformity and simplicity, we decided to make this compromise.

IIRC, training on the 1x schedule gives ~36.3 mAP, but I can't find the logs anymore and would need to re-train the model to be sure.

Let me know if you have more questions!

fmassa on 9 Apr 2020

👍1

@fmassa Thanks for your explanation! Could you give us some hints on how we can improve the mask R-CNN performance with small modification on the vision repo codes? From my understanding, 1) using stride in the 3x3 convolution should improve the performance. 2) l1 loss may be better for COCO dataset. 3) MaskRCNNPredictor, MaskRCNNPredictor, etc in vision repo use kaiming_normal_ initialization, which seems OK.

We would like to use it as our code base. However, the baseline that vision repo provides is slightly lower than other repos. And there is only a resnet50 backbone baseline. We reproduce and report the results for resnet101 or resnext101_32x8d baselines, but the reviewers said we should not use these lower baselines.

hukkai on 13 Apr 2020

Hi @KaiHoo

Can you share the difference in mAP that you got?

Also, just by the fact that you are using the torchvision ResNet weights for classification means that there will be differences in the baseline already.

I've recently worked on a research paper on object detection, and the same baseline model from 1 year ago has already been improved by some small tricks. One recent example is the slightly different version of paste_masks_in_image in detectron2, which, as a test-time change, improves mask mAP by 0.5 to 1mAP.
So in order to be the most fair in the comparisons, one should report results not by referring to the numbers in the old papers, but instead on the same base implementation that was used.

fmassa on 14 Apr 2020

@fmassa Yes we should report results on the same base implementation that was used, not all reviewers accept results that are lower than they got used to. Here are the difference:

Mask R-CNN R50-FPN:
box AP 37.9, mask AP 34.6 (vision repo official results)
box AP 38.6, mask AP 34.5 (detectron repo official results)

Mask R-CNN R101-FPN:
box AP 40.3, mask AP 36.2 (vision repo self-training results)
box AP 40.9, mask AP 36.4 (detectron repo official results)

hukkai on 16 Apr 2020

Hi @fmassa,
I finally got 37.0 AP with 1x scheduler after beta smooth l1 loss was applied.

Here are my experiments:

Hardware

NVIDIA Tesla V100 32G x 4

Software

Python 3.7
torch 1.4.0
torchvision 0.5.0

Script

$ python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --data-path /path/to/COCO2017 --dataset coco --model fasterrcnn_resnet50_fpn --epochs 13 --lr-steps 8 11 --aspect-ratio-group-factor 3 --lr 0.01 --batch-size 2 --world-size 4

Results

1. Without any modification

IoU metric: bbox                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.360                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.580                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.386                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.214                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398                                                                                                                                                                                                        
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.461                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.299                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.481                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.550                                                                                                                                                                                                        
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.635                                                                                                                                                                                                        
Training time 12:17:21

2. With smooth l1 loss

in torchvision/models/detection/rpn.py:

replace F.l1_loss with F1.smooth_l1_loss

IoU metric: bbox
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.354
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.574
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.375
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.207
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.297
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.478
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.504
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.318
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.544
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
Training time 12:20:27

3. With beta smooth l1 loss (beta = 1/9)

in torchvision/models/detection/rpn.py:

replace

box_loss = F.l1_loss(
    pred_bbox_deltas[sampled_pos_inds],
    regression_targets[sampled_pos_inds],
    reduction="sum",
) / (sampled_inds.numel())

with

def smooth_l1_loss(pred, target, beta=1.0):  # from mmdetection
    assert beta > 0
    assert pred.size() == target.size() and target.numel() > 0
    diff = torch.abs(pred - target)
    loss = torch.where(diff < beta, 0.5 * diff * diff / beta,
                       diff - 0.5 * beta)
    return loss

box_loss = smooth_l1_loss(pred_bbox_deltas[sampled_pos_inds],
                          regression_targets[sampled_pos_inds],
                          beta=1 / 9).sum() / (sampled_inds.numel())

in torchvision/models/detection/roi_heads.py:

replace

box_loss = F.smooth_l1_loss(
    box_regression[sampled_pos_inds_subset, labels_pos],
    regression_targets[sampled_pos_inds_subset],
    reduction="sum",
)

with

def smooth_l1_loss(pred, target, beta=1.0):  # from mmdetection
    assert beta > 0
    assert pred.size() == target.size() and target.numel() > 0
    diff = torch.abs(pred - target)
    loss = torch.where(diff < beta, 0.5 * diff * diff / beta,
                       diff - 0.5 * beta)
    return loss

box_loss = smooth_l1_loss(box_regression[sampled_pos_inds_subset, labels_pos],
                          regression_targets[sampled_pos_inds_subset],
                          beta=1 / 9).sum()

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.370
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.579
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.398
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.214
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.406
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.476
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.306
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.495
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.522
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.327
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.561
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.657
Training time 12:24:10

potterhsu on 16 Apr 2020

Hi all,

Wow, thanks for the investigation @potterhsu !
It seems that given the current implementation of the model, smooth_l1 does have a significant impact on performance! This comes as a surprise for me.

I think given the large difference, it might be worth considering changing the default implementation and use smooth_l1 instead. I'll retrain all models accordingly to validate that we get similar numbers.

@potterhsu can you send a PR?

fmassa on 16 Apr 2020

Sure, I've sent PR #2113 for this.

potterhsu on 17 Apr 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings