Py-faster-rcnn: Why the AnchorTargetLayer takes rpn_cls_score as input ?

Created on 19 May 2016  路  8Comments  路  Source: rbgirshick/py-faster-rcnn

I'm trying to understand the train.prototxt file of faster-rcnn and I can't figure out why this layer takes rpn_cls_score as one of its bottom

layer {
  name: 'rpn-data'
  type: 'Python'
  bottom: 'rpn_cls_score'
  bottom: 'gt_boxes'
  bottom: 'im_info'
  bottom: 'data'
  top: 'rpn_labels'
  top: 'rpn_bbox_targets'
  top: 'rpn_bbox_inside_weights'
  top: 'rpn_bbox_outside_weights'
  python_param {
    module: 'rpn.anchor_target_layer'
    layer: 'AnchorTargetLayer'
    param_str: "'feat_stride': 16 \n'scales': !!python/tuple [4, 8, 16, 32]"
  }
}

And when you take a close look at the implementation, at any time it uses all of the four bottoms. Below is the beginning of the forward function of this layer. In this function, (H,W) is the size of the input image, isn't it ?

def forward(self, bottom, top):
        # Algorithm:
        #
        # for each (H, W) location i
        #   generate 9 anchor boxes centered on cell i
        #   apply predicted bbox deltas at cell i to each of the 9 anchors
        # filter out-of-image anchors
        # measure GT overlap

        assert bottom[0].data.shape[0] == 1, \
            'Only single item batches are supported'

        # map of shape (..., H, W)
        height, width = bottom[0].data.shape[-2:]
        # GT boxes (x1, y1, x2, y2, label)
        gt_boxes = bottom[1].data
        # im_info
        im_info = bottom[2].data[0, :]

        if DEBUG:
            print ''
            print 'im_size: ({}, {})'.format(im_info[0], im_info[1])
            print 'scale: {}'.format(im_info[2])
            print 'height, width: ({}, {})'.format(height, width)
            print 'rpn: gt_boxes.shape', gt_boxes.shape
            print 'rpn: gt_boxes', gt_boxes

Most helpful comment

(H,W) is NOT the size of the input image. It is the size of the responses at the (5th) convolutional layer. Usually, it has a resolution of a downsampled image by a factor of 16 (plus/minus offsets due to pooling / padding).

All 8 comments

(H,W) is NOT the size of the input image. It is the size of the responses at the (5th) convolutional layer. Usually, it has a resolution of a downsampled image by a factor of 16 (plus/minus offsets due to pooling / padding).

@happyharrycn Ok, but when you look at the model, for example pascal/vgg16/faster_rcnn_end2end, in the train.prototxt the AnchorTagetLayer takes as bottom the DataLayer and rpc_cls_score ! It's strange, isn't it ?

@jrabary The layer needs to know the downsampled size from rpc_cls_score, so as to generate anchors.

If you are talking about bottom: 'data', this is not necessary. It is not been used in the code.

I got it. It's a little bit confusing.

That could be used to compute the stride, but the author didn't. @happyharrycn

Can you explain a bit more? Why we need stride, and why we need data to compute the stride?@nnop

In case all training images are of the same size, I think 'feat_stride' could be derived by dividing image_width (or height) with width (or height) of the 'rpn_cls_score' feature map. Say in AnchorTargetLayer.forward(),

        # size of the 'rpn_cls_score' feature map
        height, width = bottom[0].data.shape[-2:]
        # im_info
        im_info = bottom[2].data[0, :]

        feat_stride = int(im_info[1] / width)
Was this page helpful?
0 / 5 - 0 ratings