Vision: Question about Faster R-CNN ResNet_FPN backbone

Created on 21 Aug 2019 · 2Comments · Source: pytorch/vision

I was wondering the function resnet_fpn_backbone defined in https://github.com/pytorch/vision/blob/master/torchvision/models/detection/backbone_utils.py, why freeze layers except layer2,layer3 and layer4? Why the first conv7x7 and layer1 also need to freeze?
Thanks!

models question object detection

Source

baiyongrui

👍1

Most helpful comment

This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.

fmassa on 29 Aug 2019

👍4

All 2 comments

@baiyongrui
Not a complete answer, but I would say it is because the conv7x7 takes a lot of time to compute (and so even more to backpropagate) and is very unlikely to change since it is pre-trained on ImageNet. This is probably the same for the first layer.

EmilioOldenziel on 22 Aug 2019

This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.

Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3× (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.

fmassa on 29 Aug 2019

👍4

Was this page helpful?

0 / 5 - 0 ratings