Ssd_keras: [question] What is the purpose of the variances?

Created on 10 Mar 2017  路  11Comments  路  Source: rykov8/ssd_keras

Hello @rykov8 sorry for bothering you again,
Looking at your ssd_training.py file it seems like that variances do not take part of the training loss or even in any part of the training pipeline. However, in your ssd_utils.py, the method detection_out does change the prediction by multiplying them by their respective variance.

        decode_bbox_center_x = mbox_loc[:, 0] * prior_width * variances[:, 0]
        decode_bbox_center_x += prior_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * prior_width * variances[:, 1]
        decode_bbox_center_y += prior_center_y
        decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[:, 2])
        decode_bbox_width *= prior_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[:, 3])

I do understand that one has to decode the boxes since they were encoded using the transformation described in equation 2 from SSD (and faster R-CNN)

My main concern is that the variances are changing explicitly the values already outputted by the CNN without considering them directly in the training procedure; furthermore, I do not seem to find any reference, neither in the SSD or in Faster R-CNN papers, that make a reference to these variances. Maybe I am missing something in the papers or in the implementation, in that case I would be very grateful if you could tell me if I making a mistake or maybe if you could elaborate on the use of these variances.

Thank you very much.

Most helpful comment

@villanuevab here the author of the original paper comments about the variances. Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.

All 11 comments

They are taken into account in encode_box which is called by assign_boxes which is used to create the labeled data for training.

Edit: pretty sure they're used for numerical conditioning of the problem, otherwise the detection offsets would be on a different scale than the classifications and slow down the optimization.

@tachim thank you very much! I did not see they were being used in the encode_box method. Then everything makes sense, and as you said, they are probably used for numerical conditioning.

Hello @villanuevab
As far as I am concerned they scale the transformed box coordinates to make the classification task easier. They are not mentioned on the paper.

Please see the inline comments:

# the following 2 lines => g_hat for cx, cy
encoded_box[:, :2][assign_mask] = box_center - assigned_priors_center
encoded_box[:, :2][assign_mask] /= assigned_priors_wh
# here we divide by the "variance" of cx, cy of the prior boxes
encoded_box[:, :2][assign_mask] /= assigned_priors[:, -4:-2]
# the following line => g_hat for wh
encoded_box[:, 2:4][assign_mask] = np.log(box_wh /
                                          assigned_priors_wh)
# here we divide by the "variance" of wh of the prior boxes
encoded_box[:, 2:4][assign_mask] /= assigned_priors[:, -2:]

May I ask: What are these "variances" on a conceptual level and why are they used? Why not just use the \hat{g} definitions? Thanks to this thread, I understand that they are incorporated into the training pipeline, but on a conceptual level I cannot match these variances with:

  1. any concept outlined in the paper,
  2. any mathematical definition of variance that I know of.

How are these variances initialized? I.e., why are they set to [0.1] as a default in PriorBox?

I'd really appreciate some clarification here! Thank you.

@oarriaga I edited my question to be more precise; I think you addressed my confusion on why they are included in the code, even if they are not in the paper. Thank you!

Can you elaborate on why they are called "variances" and how they are initialized?

I believe the term variances is misleading and they should be called something like "box_scale_factors" or at least thats how I call it on my SSD implementation

@villanuevab here the author of the original paper comments about the variances. Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.

Hi @rykov8
when i evaluate the model and look into the numbers the network return
i can see that the variance is the same always..
is that ok?
if it's suppose to be same.. way not save as default parameters and not as network output?

When I am testing the assign boxes function I get negative coordinates for the assigned bounding boxes which is caused due to the encoding after the IOU step inside encode_box function.

variances are coefficients for encoding/decoding the locations of bounding boxes.
The first value is used to encode/decode coordinates of the centers.
The second value is used to encode/decode the sizes of bounding boxes.

Why the encoding/decoding is needed ? is it for using less parameters while training (2 instead of 4 )? for faster optimization ? less calculations ?

In my opinion, most of the implementations on bounding box encoding/decoding with variance were conceptually incorrect. Please check my post at https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/ and provide some feedback on how you think. Thank you.

Thank you, @leimao. Very helpful blog post!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

abduallahmohamed picture abduallahmohamed  路  16Comments

emillion92 picture emillion92  路  4Comments

fferroni picture fferroni  路  16Comments

ayushchopra96 picture ayushchopra96  路  15Comments

MrXu picture MrXu  路  5Comments