Hi @fmassa, thanks for the great codes.
I am confused about COCO AP of Faster R-CNN ResNet-50 FPN,
from Document and #925 and Source Code,
I guess that the model Faster R-CNN ResNet-50 FPN was trained with following hyperparameters and got AP 37.0, am I right?
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:----------------:|:--------------:|:--------:|
| vision | R-50 FPN | 37.0 | 2x | 26 | 16, 22 | 16 | 0.02 |
batch_size = 2 * 8 (NUM_GPU) = 16
However, I noticed that the box AP in maskrcnn-benchmark and Detectron seems to have better performance as below:
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:-----------------:|:--------------:|:--------:|
| maskrcnn-benchmark | R-50 FPN | 36.8 | 1x | 12.28 | 8.19, 10.92 | 16 | 0.02 |
| Detectron | R-50 FPN | 36.7 | 1x | 12.28 | 8.19, 10.92 | 16 | 0.02 |
| Detectron | R-50 FPN | 37.9 | 2x | 24.56 | 16.37, 21.83 | 16 | 0.02 |
from maskrcnn-benchmark 1x config
epochs = 90000 (steps) * 16 (batch size) / 117266 (training images per epoch) = 12.28
btw, COCO2017 has 118287 training images but only 117266 training images contain at least one object
I would like to know what causes this gap?
Besides, could I have the result which trained with scheduler 1x?
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:----------------:|:--------------:|:--------:|
| vision | R-50 FPN | ?? | 1x | 13 | 8, 11 | 16 | 0.02 |
Thank you!
Hi,
There are a few differences between both implementations that lead to this difference in mAP:
Those all cumulate to lead to this discrepancy that you see.
Given the complexity of Faster R-CNN as a model, every tiny detail can change a bit the dynamics of the training, while producing in the end (after more epochs) comparable models, so for the sake of uniformity and simplicity, we decided to make this compromise.
IIRC, training on the 1x schedule gives ~36.3 mAP, but I can't find the logs anymore and would need to re-train the model to be sure.
Let me know if you have more questions!
@fmassa Thanks for your explanation! Could you give us some hints on how we can improve the mask R-CNN performance with small modification on the vision repo codes? From my understanding, 1) using stride in the 3x3 convolution should improve the performance. 2) l1 loss may be better for COCO dataset. 3) MaskRCNNPredictor, MaskRCNNPredictor, etc in vision repo use kaiming_normal_ initialization, which seems OK.
We would like to use it as our code base. However, the baseline that vision repo provides is slightly lower than other repos. And there is only a resnet50 backbone baseline. We reproduce and report the results for resnet101 or resnext101_32x8d baselines, but the reviewers said we should not use these lower baselines.
Hi @KaiHoo
Can you share the difference in mAP that you got?
Also, just by the fact that you are using the torchvision ResNet weights for classification means that there will be differences in the baseline already.
I've recently worked on a research paper on object detection, and the same baseline model from 1 year ago has already been improved by some small tricks. One recent example is the slightly different version of paste_masks_in_image in detectron2, which, as a test-time change, improves mask mAP by 0.5 to 1mAP.
So in order to be the most fair in the comparisons, one should report results not by referring to the numbers in the old papers, but instead on the same base implementation that was used.
@fmassa Yes we should report results on the same base implementation that was used, not all reviewers accept results that are lower than they got used to. Here are the difference:
Mask R-CNN R50-FPN:
box AP 37.9, mask AP 34.6 (vision repo official results)
box AP 38.6, mask AP 34.5 (detectron repo official results)
Mask R-CNN R101-FPN:
box AP 40.3, mask AP 36.2 (vision repo self-training results)
box AP 40.9, mask AP 36.4 (detectron repo official results)
Hi @fmassa,
I finally got 37.0 AP with 1x scheduler after beta smooth l1 loss was applied.
Here are my experiments:
$ python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --data-path /path/to/COCO2017 --dataset coco --model fasterrcnn_resnet50_fpn --epochs 13 --lr-steps 8 11 --aspect-ratio-group-factor 3 --lr 0.01 --batch-size 2 --world-size 4
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:----------------:|:--------------:|:--------:|
| vision | R-50 FPN | 0.360 | 1x | 13 | 8, 11 | 8 | 0.01 |
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.360
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.580
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.386
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.214
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.398
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.461
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.299
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.481
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.507
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.321
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.550
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.635
Training time 12:17:21
in torchvision/models/detection/rpn.py:
replace F.l1_loss with F1.smooth_l1_loss
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:----------------:|:--------------:|:--------:|
| vision | R-50 FPN | 0.354 | 1x | 13 | 8, 11 | 8 | 0.01 |
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.354
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.574
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.375
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.207
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.391
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.450
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.297
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.478
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.504
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.318
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.544
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.634
Training time 12:20:27
in torchvision/models/detection/rpn.py:
replace
box_loss = F.l1_loss(
pred_bbox_deltas[sampled_pos_inds],
regression_targets[sampled_pos_inds],
reduction="sum",
) / (sampled_inds.numel())
with
def smooth_l1_loss(pred, target, beta=1.0): # from mmdetection
assert beta > 0
assert pred.size() == target.size() and target.numel() > 0
diff = torch.abs(pred - target)
loss = torch.where(diff < beta, 0.5 * diff * diff / beta,
diff - 0.5 * beta)
return loss
box_loss = smooth_l1_loss(pred_bbox_deltas[sampled_pos_inds],
regression_targets[sampled_pos_inds],
beta=1 / 9).sum() / (sampled_inds.numel())
in torchvision/models/detection/roi_heads.py:
replace
box_loss = F.smooth_l1_loss(
box_regression[sampled_pos_inds_subset, labels_pos],
regression_targets[sampled_pos_inds_subset],
reduction="sum",
)
with
def smooth_l1_loss(pred, target, beta=1.0): # from mmdetection
assert beta > 0
assert pred.size() == target.size() and target.numel() > 0
diff = torch.abs(pred - target)
loss = torch.where(diff < beta, 0.5 * diff * diff / beta,
diff - 0.5 * beta)
return loss
box_loss = smooth_l1_loss(box_regression[sampled_pos_inds_subset, labels_pos],
regression_targets[sampled_pos_inds_subset],
beta=1 / 9).sum()
| Repo | Network | box AP | scheduler | epochs | lr-steps | batch size | lr |
|:-----------------------------:|:-------------:|:----------:|:-------------:|:---------:|:----------------:|:--------------:|:--------:|
| vision | R-50 FPN | 0.370 | 1x | 13 | 8, 11 | 8 | 0.01 |
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.370
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.579
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.398
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.214
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.406
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.476
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.306
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.495
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.522
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.327
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.561
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.657
Training time 12:24:10
Hi all,
Wow, thanks for the investigation @potterhsu !
It seems that given the current implementation of the model, smooth_l1 does have a significant impact on performance! This comes as a surprise for me.
I think given the large difference, it might be worth considering changing the default implementation and use smooth_l1 instead. I'll retrain all models accordingly to validate that we get similar numbers.
@potterhsu can you send a PR?
Sure, I've sent PR #2113 for this.