I was wondering the function resnet_fpn_backbone defined in https://github.com/pytorch/vision/blob/master/torchvision/models/detection/backbone_utils.py, why freeze layers except layer2,layer3 and layer4? Why the first conv7x7 and layer1 also need to freeze?
Thanks!
@baiyongrui
Not a complete answer, but I would say it is because the conv7x7 takes a lot of time to compute (and so even more to backpropagate) and is very unlikely to change since it is pre-trained on ImageNet. This is probably the same for the first layer.
This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.
Does this mean that all conv layers should be fine-tuned? In short, no. In the smaller networks (S and M) we find that conv1 is generic and task independent (a well-known fact [14]). Allowing conv1 to learn, or not, has no meaningful effect on mAP. For VGG16, we found it only necessary to update layers from conv3 1 and up (9 of the 13 conv layers). This observation is pragmatic: (1) updating from conv2 1 slows training by 1.3脳 (12.5 vs. 9.5 hours) compared to learning from conv3 1; and (2) updating from conv1 1 over-runs GPU memory. The difference in mAP when learning from conv2 1 up was only +0.3 points (Table 5, last column). All Fast R-CNN results in this paper using VGG16 fine-tune layers conv3 1 and up; all experiments with models S and M fine-tune layers conv2 and up.
Most helpful comment
This is what the original Fast R-CNN paper did (for other architectures, but this still hold for resnet), see section 4.5 in the paper for more details, which I copy in part here.