Vision: Question about Faster R-CNN ResNet_FPN backbone

Created on 21 Aug 2019  路  2Comments  路  Source: pytorch/vision

I was wondering the function resnet_fpn_backbone defined in https://github.com/pytorch/vision/blob/master/torchvision/models/detection/backbone_utils.py, why freeze layers except layer2,layer3 and layer4? Why the first conv7x7 and layer1 also need to freeze?
Thanks!

models question object detection

Most helpful comment

This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3脳 (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.

All 2 comments

@baiyongrui
Not a complete answer, but I would say it is because the conv7x7 takes a lot of time to compute (and so even more to backpropagate) and is very unlikely to change since it is pre-trained on ImageNet. This is probably the same for the first layer.

This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3脳 (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.

Was this page helpful?
0 / 5 - 0 ratings