Mask_rcnn: BBOX_STD_DEV usage is unclear and possibly incorrect?

Created on 21 Feb 2018  Â·  22Comments  Â·  Source: matterport/Mask_RCNN

This is really great project - love the level of comments, the working notebooks, and the simple shapes dataset.....all makes it exceptionally easy to read.

One thing that is not clear - In the RPN proposal layer the deltas are multiplied by BBOX_STD_DEV. There was a previous issue asking about this https://github.com/matterport/Mask_RCNN/issues/85. The answer given refers to fastrcnn paper section on normalising the regression targets for the loss function.

However this is applied after the RPN loss function; and before the bbox spec is normalised to 0-1 range. Also two of the delta numbers are log(delta). I can possibly envisage that you might divide by the standard deviation as part of normalisation. However this is multiplying; by a log; and not in the right place for that.

Most helpful comment

@simonm3 The regressor performs best if its outputs have a mean of zero and a standard deviation of one. To achieve that we need to do two transformations:

  1. When preparing the training targets for the regressor, subtract the mean and divide by the standard deviation. This will give the regressor normalized training targets. That also means that the regressor is going to predict normalized values rather than the actual values we would use to shift and resize the anchors.
  2. After training is complete, when we want to use the output from the regressor, we multiply by the standard deviation and add the mean to reverse the transformation in 1. This converts the predicted normalized values into real values we can use to shift and resize the anchor boxes.

Knowing that anchors cover the image uniformly, it's fair to assume that shifts and resizes are evenly distributed between positive and negative, so we assume the mean is close to zero already, and skip the part I mentioned above about subtracting the mean.

For the standard deviation, ideally we'd measure it across the dataset. In this case, as @FruVirus guessed, I used the same values that R. Girshick used in Faster RCNN. From their config file:

# Normalize the targets using "precomputed" (or made up) means and stdevs
# (BBOX_NORMALIZE_TARGETS must also be True)
__C.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED = False
__C.TRAIN.BBOX_NORMALIZE_MEANS = (0.0, 0.0, 0.0, 0.0)
__C.TRAIN.BBOX_NORMALIZE_STDS = (0.1, 0.1, 0.2, 0.2)

All 22 comments

Do you know where the exact values for BBOX_STD_DEV comes from? It might provide some insight into why those values are multiplied even for the log(delta) elements. I'm curious to know this as well.

Also a good question. The exact values in the default config file are .1,
.1, .1, .2. If those are standard deviations then likely of a small number.

On 21 February 2018 at 22:42, FruVirus notifications@github.com wrote:

Do you know where the exact values for BBOX_STD_DEV comes from?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/matterport/Mask_RCNN/issues/270#issuecomment-367502544,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABJN6eymHYYgYuTJhd1fB-8vKqyUjQ9Tks5tXJvhgaJpZM4SOOVL
.

Looking at the threads on the Faster R-CNN channels, it seems this question was never answered either. I suppose one way to test it out is just to train on a dataset and try it with and without the standard deviation normalization and play around with the numbers.

I get the feeling that these numbers were empirically found by R. Girshick when he was training on the COCO dataset and is likely to be different depending on the particular dataset.

@simonm3 The regressor performs best if its outputs have a mean of zero and a standard deviation of one. To achieve that we need to do two transformations:

  1. When preparing the training targets for the regressor, subtract the mean and divide by the standard deviation. This will give the regressor normalized training targets. That also means that the regressor is going to predict normalized values rather than the actual values we would use to shift and resize the anchors.
  2. After training is complete, when we want to use the output from the regressor, we multiply by the standard deviation and add the mean to reverse the transformation in 1. This converts the predicted normalized values into real values we can use to shift and resize the anchor boxes.

Knowing that anchors cover the image uniformly, it's fair to assume that shifts and resizes are evenly distributed between positive and negative, so we assume the mean is close to zero already, and skip the part I mentioned above about subtracting the mean.

For the standard deviation, ideally we'd measure it across the dataset. In this case, as @FruVirus guessed, I used the same values that R. Girshick used in Faster RCNN. From their config file:

# Normalize the targets using "precomputed" (or made up) means and stdevs
# (BBOX_NORMALIZE_TARGETS must also be True)
__C.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED = False
__C.TRAIN.BBOX_NORMALIZE_MEANS = (0.0, 0.0, 0.0, 0.0)
__C.TRAIN.BBOX_NORMALIZE_STDS = (0.1, 0.1, 0.2, 0.2)

Thank for great explanation.

How would you measure the standard-deviation if you really wanted to?

@CMCDragonkai I suppose you can set the std values in BBOX_STD_DEV all to 1 to see how large the stdev of the relative shifts really are before normalization.

Concretely, in the data generator section of inspect_data.ipynb, after you get the rpn_bbox,

std_list = []
for i in range(4):
    a = rpn_bbox[:, :, i]
    std_list.append(np.std(a[np.nonzero(a)]))
print(std_list)

should give you the stdev list.

But there are 2:

    # Bounding box refinement standard deviation for RPN and final detections.
    RPN_BBOX_STD_DEV = np.array([0.1, 0.1, 0.2, 0.2])
    BBOX_STD_DEV = np.array([0.1, 0.1, 0.2, 0.2])

If we measure it exactly as in your example, are you saying we should use those as our hyperparameters?

I think you should be able to use those results as hyperparameters in replace of the default [0.1, 0.1, 0.2, 0.2]. That said, I don't think tuning these parameters accurately will give you much performance boost, as long as the parameters are on the right order of magnitude. The default values are pretty good estimates in that regard, as the stdev of a uniform distribution between 0 and 1 is (1/12) ** 0.5 ~ 0.28.

The anchors are uniform across the image, but what are being normalized are the deltas to transform the anchors to the correct bounding box. That being said, why would the deltas have a mean of zero? This is only the case if a bounding box is located in the center. If the point of interest if located off center, then wouldn't you also need to subtract the mean deltas?

@gaklein2007 the deltas are the difference in coordinates to the nearest anchor box (or the anchor box with the best IoU with the bbox in question), not only the anchor box in the center. The relative shift of any bbox to the nearest anchor box is uniformly random, therefore the mean is roughly zero.

@patrick-12sigma I am not referring to a anchor box in the center, but a ground truth bounding box. I understand the distance between any bounding box and its closest neighbors will be an average of zero as there would be an equal amount surrounding a bounding box from all sides. However, when making the rpn deltas in the generator, negative anchors are also added to add background into it and since the bounding box target is not necessarily center, these do not necessarily have to average out to zero, especially if going from sample to sample the object of interest changes position.

Actually, disregard that last point as I forgot for the deltas negatives are not considered, only positive anchors.

That is right. Deltas for negative cases are not accounted for in the regression loss.

If my regression is [y, x, h, w, theta], like something rotate.
How should i set the rotate BBOX_NORMALIZE_MEANS and BBOX_NORMALIZE_STDS ?
Still [0.0, 0.0, 0.0, 0.0, 0.0] and [0.1, 0.1, 0.2, 0.2, 0.2]?

@patrick-12sigma I think the issue is my ground truth boxes, which is where my issue is coming from. I looked to see what the mean and standard deviation for all rpn deltas for all my data and noticed it was not close to 0 mean and (0.1, 0.1, 0.2, 0.2) std. Wondering if I screwed up my code somehow, I downloaded the entire git repo and built the balloons dataset, and found the mean and std for 50 samples worth of deltas. Unsurprisingly, it came out to 0 mean and what std expected. I then put in the ground truth boxes in for a few of my samples and it could not even find overlaps > 0.7.

I guess my question is what should the configurations for the deltas be for very small bounding boxes?

@gaklein2007 for very small bboxes you should change your anchor box scale to match the bboxes you would like to detect. I would do a histogram of the groundtruth bbox sizes to make sure the anchor scales have a good match with that.

I have been tweaking the scales but without much luck. I also changed a part of the code that had a TODO on it, which actually made it significantly worse. After changing it back to the original version and just padding the coordinates of the bounding boxes, I am testing it to see if that was why the FRCNN (not attempting segmentations yet) was failing. I believe that was the cause of the data issues.

In general I would first give a few examples for the network to learn. We expect the network would memorize the examples, and if you evaluate the performance on the training set, you should get almost perfect score. Otherwise it means the pipeline is broken, for example the data loading part.

You mean pretraining it on some samples that have an iou > 0.7 and then feeding in the rest of the data?

@gaklein2007 I am talking about feeding one or two images to the network first to make sure the network is able to overfit to your limited training examples.

@simonm3 The regressor performs best if its outputs have a mean of zero and a standard deviation of one. To achieve that we need to do two transformations:

  1. When preparing the training targets for the regressor, subtract the mean and divide by the standard deviation. This will give the regressor normalized training targets. That also means that the regressor is going to predict normalized values rather than the actual values we would use to shift and resize the anchors.
  2. After training is complete, when we want to use the output from the regressor, we multiply by the standard deviation and add the mean to reverse the transformation in 1. This converts the predicted normalized values into real values we can use to shift and resize the anchor boxes.

Knowing that anchors cover the image uniformly, it's fair to assume that shifts and resizes are evenly distributed between positive and negative, so we assume the mean is close to zero already, and skip the part I mentioned above about subtracting the mean.

For the standard deviation, ideally we'd measure it across the dataset. In this case, as @FruVirus guessed, I used the same values that R. Girshick used in Faster RCNN. From their config file:

# Normalize the targets using "precomputed" (or made up) means and stdevs
# (BBOX_NORMALIZE_TARGETS must also be True)
__C.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED = False
__C.TRAIN.BBOX_NORMALIZE_MEANS = (0.0, 0.0, 0.0, 0.0)
__C.TRAIN.BBOX_NORMALIZE_STDS = (0.1, 0.1, 0.2, 0.2)

waleedka's explanation is helpful.

Was this page helpful?
0 / 5 - 0 ratings