I'm work on the source code of Mask_RCNN and I find something interesting.
Code:
print(config.RPN_ANCHOR_SCALES)
print(config.RPN_ANCHOR_RATIOS)
print(config.BACKBONE_SHAPES)
print(config.BACKBONE_STRIDES)
print(config.RPN_ANCHOR_STRIDE)
output:
(8, 16, 32, 64, 128)
[0.5, 1, 2]
[[32 32]
[16 16]
[ 8 8]
[ 4 4]
[ 2 2]]
[4, 8, 16, 32, 64]
1
We have 5 kinds of feature maps in different size: 32*32, 16*16, 8*8, 4*4, 2*2.
In each of the pixel of each feature map, we generate 3 kinds of anchors of different ratios. In other words, the anchors in each feature map should be [32*32, 16*16, 8*8, 4*4, 2*2] * 3, but I find that the number of anchors generated by function generate_pyramid_anchors() is three times the number above.
Code:
boxes = generate_pyramid_anchors(config.RPN_ANCHOR_SCALES,
config.RPN_ANCHOR_RATIOS,
config.BACKBONE_SHAPES,
config.BACKBONE_STRIDES,
config.RPN_ANCHOR_STRIDE)
output:
scales= 8 , shape= [32 32]
boxes.shape= (9216, 4)
scales= 16 , shape= [16 16]
boxes.shape= (2304, 4)
scales= 32 , shape= [8 8]
boxes.shape= (576, 4)
scales= 64 , shape= [4 4]
boxes.shape= (144, 4)
scales= 128 , shape= [2 2]
boxes.shape= (36, 4)
Code:
print(boxes[-36:])
output:
array([[ -90.50966799, -45.254834 , 90.50966799, 45.254834 ],
[ -90.50966799, -45.254834 , 90.50966799, 45.254834 ],
[ -90.50966799, -45.254834 , 90.50966799, 45.254834 ],
[ -64. , -64. , 64. , 64. ],
[ -64. , -64. , 64. , 64. ],
[ -64. , -64. , 64. , 64. ],
[ -45.254834 , -90.50966799, 45.254834 , 90.50966799],
[ -45.254834 , -90.50966799, 45.254834 , 90.50966799],
[ -45.254834 , -90.50966799, 45.254834 , 90.50966799],
[ -90.50966799, 18.745166 , 90.50966799, 109.254834 ],
[ -90.50966799, 18.745166 , 90.50966799, 109.254834 ],
[ -90.50966799, 18.745166 , 90.50966799, 109.254834 ],
[ -64. , 0. , 64. , 128. ],
[ -64. , 0. , 64. , 128. ],
[ -64. , 0. , 64. , 128. ],
[ -45.254834 , -26.50966799, 45.254834 , 154.50966799],
[ -45.254834 , -26.50966799, 45.254834 , 154.50966799],
......
Those anchors repeated 3 times. I wonder is this a bug or just for convenient?
@Mabinogiysk I find in the original paper, they used k = 9(3 scales and 3 ratios)anchors. But in this version, they just generated 3(1 scales, 3 ratios). I get confused about this.
@Superlee506
The reason why this implementation of Mask R-CNN uses only 1 scale and 3 ratios at each scale for anchors is because it incorporates FPN. As stated in the FPN paper, Section 4.1, Feature Pyramid Networks for RPN:
"Because the head slides densely over all locations at all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level."
In other words, the FPN takes care of the scale issue by virtue by having different pyramid levels, each addressing a different scale. Thus, there is no need to have multiple scales at each FPN level. We just simply need different anchor ratios for each scale at each level.
@FruVirus Copy that, thanks
Most helpful comment
@Superlee506
The reason why this implementation of Mask R-CNN uses only 1 scale and 3 ratios at each scale for anchors is because it incorporates FPN. As stated in the FPN paper, Section 4.1, Feature Pyramid Networks for RPN:
"Because the head slides densely over all locations at all pyramid levels, it is not necessary to have multi-scale anchors on a specific level. Instead, we assign anchors of a single scale to each level."
In other words, the FPN takes care of the scale issue by virtue by having different pyramid levels, each addressing a different scale. Thus, there is no need to have multiple scales at each FPN level. We just simply need different anchor ratios for each scale at each level.