Hi, I'm not a deep learning expert so I apologize if that is a trivial question.
I'm a bit confused by the
_C.MODEL.ROI_BOX_HEAD.POOLER_SCALES = (0.25, 0.125, 0.0625, 0.03125)
_C.MODEL.ROI_BOX_HEAD.POOLER_SAMPLING_RATIO = 2
parameters in the yacs config files. I looked at the Pooler and relevant roiAlign cuda code, but I'm still not sure how these values are computed and what they mean. Can somebody please explain them? Thanks.
No need to apologize, there is no trivial question.
These scales stand for the reduction scales caused by the backbone's strides. BTW, you should understand well the ResNet and ResNeXt architectures to better understand this explanation.
For instance, suppose you found a RoI of coordinates [0, 0, 64, 64] in the input image. Suppose again that you want to pool its features from all backbone's levels (here, a backbone is a ResNet or ResNeXt architecture).
So, since there is a stride of 2 in the conv1 layer and another stride of 2 at the end of the first block, it results in a feature-map 4x smaller than the original image, thus, a scale of 0.25. Since, there is a stride of 2 between all the convolution blocks of the backbone, the scale gets divided by 2 at each level.
Hence, the coordinates of your RoI will be:
[0, 0, 16, 16] in the first level feature-map [0, 0, 8, 8] in the second level feature-map [0, 0, 4, 4] in the third level feature-map [0, 0, 2, 2] in the fourth level feature-map The sampling_ratio parameter determines how many samples you want to do in the bi-linear interpolation of the RoIAlign algorithm.
Thank you very much @LeviViana . That is a very good explanation!
Thanks for your great explanation @LeviViana !
thanks
Most helpful comment
No need to apologize, there is no trivial question.
These scales stand for the reduction scales caused by the backbone's strides. BTW, you should understand well the ResNet and ResNeXt architectures to better understand this explanation.
For instance, suppose you found a RoI of coordinates
[0, 0, 64, 64]in the input image. Suppose again that you want to pool its features from all backbone's levels (here, a backbone is a ResNet or ResNeXt architecture).So, since there is a stride of 2 in the
conv1layer and another stride of 2 at the end of the first block, it results in a feature-map 4x smaller than the original image, thus, a scale of 0.25. Since, there is a stride of 2 between all the convolution blocks of the backbone, the scale gets divided by 2 at each level.Hence, the coordinates of your RoI will be:
[0, 0, 16, 16]in the first level feature-map[0, 0, 8, 8]in the second level feature-map[0, 0, 4, 4]in the third level feature-map[0, 0, 2, 2]in the fourth level feature-mapThe
sampling_ratioparameter determines how many samples you want to do in the bi-linear interpolation of the RoIAlign algorithm.