I'm trying to understand the train.prototxt file of faster-rcnn and I can't figure out why this layer takes rpn_cls_score as one of its bottom
layer {
name: 'rpn-data'
type: 'Python'
bottom: 'rpn_cls_score'
bottom: 'gt_boxes'
bottom: 'im_info'
bottom: 'data'
top: 'rpn_labels'
top: 'rpn_bbox_targets'
top: 'rpn_bbox_inside_weights'
top: 'rpn_bbox_outside_weights'
python_param {
module: 'rpn.anchor_target_layer'
layer: 'AnchorTargetLayer'
param_str: "'feat_stride': 16 \n'scales': !!python/tuple [4, 8, 16, 32]"
}
}
And when you take a close look at the implementation, at any time it uses all of the four bottoms. Below is the beginning of the forward function of this layer. In this function, (H,W) is the size of the input image, isn't it ?
def forward(self, bottom, top):
# Algorithm:
#
# for each (H, W) location i
# generate 9 anchor boxes centered on cell i
# apply predicted bbox deltas at cell i to each of the 9 anchors
# filter out-of-image anchors
# measure GT overlap
assert bottom[0].data.shape[0] == 1, \
'Only single item batches are supported'
# map of shape (..., H, W)
height, width = bottom[0].data.shape[-2:]
# GT boxes (x1, y1, x2, y2, label)
gt_boxes = bottom[1].data
# im_info
im_info = bottom[2].data[0, :]
if DEBUG:
print ''
print 'im_size: ({}, {})'.format(im_info[0], im_info[1])
print 'scale: {}'.format(im_info[2])
print 'height, width: ({}, {})'.format(height, width)
print 'rpn: gt_boxes.shape', gt_boxes.shape
print 'rpn: gt_boxes', gt_boxes
(H,W) is NOT the size of the input image. It is the size of the responses at the (5th) convolutional layer. Usually, it has a resolution of a downsampled image by a factor of 16 (plus/minus offsets due to pooling / padding).
@happyharrycn Ok, but when you look at the model, for example pascal/vgg16/faster_rcnn_end2end, in the train.prototxt the AnchorTagetLayer takes as bottom the DataLayer and rpc_cls_score ! It's strange, isn't it ?
@jrabary The layer needs to know the downsampled size from rpc_cls_score, so as to generate anchors.
If you are talking about bottom: 'data', this is not necessary. It is not been used in the code.
I got it. It's a little bit confusing.
That could be used to compute the stride, but the author didn't. @happyharrycn
Can you explain a bit more? Why we need stride, and why we need data to compute the stride?@nnop
In case all training images are of the same size, I think 'feat_stride' could be derived by dividing image_width (or height) with width (or height) of the 'rpn_cls_score' feature map. Say in AnchorTargetLayer.forward(),
# size of the 'rpn_cls_score' feature map
height, width = bottom[0].data.shape[-2:]
# im_info
im_info = bottom[2].data[0, :]
feat_stride = int(im_info[1] / width)
Most helpful comment
(H,W) is NOT the size of the input image. It is the size of the responses at the (5th) convolutional layer. Usually, it has a resolution of a downsampled image by a factor of 16 (plus/minus offsets due to pooling / padding).