Py-faster-rcnn: Why the AnchorTargetLayer takes rpn_cls_score as input ?

Created on 19 May 2016 · 8Comments · Source: rbgirshick/py-faster-rcnn

I'm trying to understand the train.prototxt file of faster-rcnn and I can't figure out why this layer takes rpn_cls_score as one of its bottom

layer {
  name: 'rpn-data'
  type: 'Python'
  bottom: 'rpn_cls_score'
  bottom: 'gt_boxes'
  bottom: 'im_info'
  bottom: 'data'
  top: 'rpn_labels'
  top: 'rpn_bbox_targets'
  top: 'rpn_bbox_inside_weights'
  top: 'rpn_bbox_outside_weights'
  python_param {
    module: 'rpn.anchor_target_layer'
    layer: 'AnchorTargetLayer'
    param_str: "'feat_stride': 16 \n'scales': !!python/tuple [4, 8, 16, 32]"
  }
}

And when you take a close look at the implementation, at any time it uses all of the four bottoms. Below is the beginning of the forward function of this layer. In this function, (H,W) is the size of the input image, isn't it ?

def forward(self, bottom, top):
        # Algorithm:
        #
        # for each (H, W) location i
        #   generate 9 anchor boxes centered on cell i
        #   apply predicted bbox deltas at cell i to each of the 9 anchors
        # filter out-of-image anchors
        # measure GT overlap

        assert bottom[0].data.shape[0] == 1, \
            'Only single item batches are supported'

        # map of shape (..., H, W)
        height, width = bottom[0].data.shape[-2:]
        # GT boxes (x1, y1, x2, y2, label)
        gt_boxes = bottom[1].data
        # im_info
        im_info = bottom[2].data[0, :]

        if DEBUG:
            print ''
            print 'im_size: ({}, {})'.format(im_info[0], im_info[1])
            print 'scale: {}'.format(im_info[2])
            print 'height, width: ({}, {})'.format(height, width)
            print 'rpn: gt_boxes.shape', gt_boxes.shape
            print 'rpn: gt_boxes', gt_boxes

Source

jrabary

Most helpful comment

(H,W) is NOT the size of the input image. It is the size of the responses at the (5th) convolutional layer. Usually, it has a resolution of a downsampled image by a factor of 16 (plus/minus offsets due to pooling / padding).

happyharrycn on 19 May 2016

👍2

All 8 comments

happyharrycn on 19 May 2016

👍2

@happyharrycn Ok, but when you look at the model, for example pascal/vgg16/faster_rcnn_end2end, in the train.prototxt the AnchorTagetLayer takes as bottom the DataLayer and rpc_cls_score ! It's strange, isn't it ?

jrabary on 19 May 2016

@jrabary The layer needs to know the downsampled size from rpc_cls_score, so as to generate anchors.

yzhsjtu on 19 May 2016

If you are talking about bottom: 'data', this is not necessary. It is not been used in the code.

happyharrycn on 19 May 2016

I got it. It's a little bit confusing.

jrabary on 19 May 2016

That could be used to compute the stride, but the author didn't. @happyharrycn

nnop on 17 Aug 2017

Can you explain a bit more? Why we need stride, and why we need data to compute the stride?@nnop

wenyafei4 on 7 Sep 2017

In case all training images are of the same size, I think 'feat_stride' could be derived by dividing image_width (or height) with width (or height) of the 'rpn_cls_score' feature map. Say in AnchorTargetLayer.forward(),

        # size of the 'rpn_cls_score' feature map
        height, width = bottom[0].data.shape[-2:]
        # im_info
        im_info = bottom[2].data[0, :]

        feat_stride = int(im_info[1] / width)