Ssd_keras: [question] What is the purpose of the variances?

Created on 10 Mar 2017 · 11Comments · Source: rykov8/ssd_keras

Hello @rykov8 sorry for bothering you again,
Looking at your ssd_training.py file it seems like that variances do not take part of the training loss or even in any part of the training pipeline. However, in your ssd_utils.py, the method detection_out does change the prediction by multiplying them by their respective variance.

        decode_bbox_center_x = mbox_loc[:, 0] * prior_width * variances[:, 0]
        decode_bbox_center_x += prior_center_x
        decode_bbox_center_y = mbox_loc[:, 1] * prior_width * variances[:, 1]
        decode_bbox_center_y += prior_center_y
        decode_bbox_width = np.exp(mbox_loc[:, 2] * variances[:, 2])
        decode_bbox_width *= prior_width
        decode_bbox_height = np.exp(mbox_loc[:, 3] * variances[:, 3])

I do understand that one has to decode the boxes since they were encoded using the transformation described in equation 2 from SSD (and faster R-CNN)

My main concern is that the variances are changing explicitly the values already outputted by the CNN without considering them directly in the training procedure; furthermore, I do not seem to find any reference, neither in the SSD or in Faster R-CNN papers, that make a reference to these variances. Maybe I am missing something in the papers or in the implementation, in that case I would be very grateful if you could tell me if I making a mistake or maybe if you could elaborate on the use of these variances.

Thank you very much.

Source

oarriaga

Most helpful comment

@villanuevab here the author of the original paper comments about the variances. Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.

rykov8 on 23 May 2017

👍8 ❤7 🎉4 😄3 🚀1

All 11 comments

They are taken into account in encode_box which is called by assign_boxes which is used to create the labeled data for training.

Edit: pretty sure they're used for numerical conditioning of the problem, otherwise the detection offsets would be on a different scale than the classifications and slow down the optimization.

tachim on 10 Mar 2017

@tachim thank you very much! I did not see they were being used in the encode_box method. Then everything makes sense, and as you said, they are probably used for numerical conditioning.

oarriaga on 11 Mar 2017

Hello @villanuevab
As far as I am concerned they scale the transformed box coordinates to make the classification task easier. They are not mentioned on the paper.

oarriaga on 23 May 2017

👍5

Please see the inline comments:

# the following 2 lines => g_hat for cx, cy
encoded_box[:, :2][assign_mask] = box_center - assigned_priors_center
encoded_box[:, :2][assign_mask] /= assigned_priors_wh
# here we divide by the "variance" of cx, cy of the prior boxes
encoded_box[:, :2][assign_mask] /= assigned_priors[:, -4:-2]
# the following line => g_hat for wh
encoded_box[:, 2:4][assign_mask] = np.log(box_wh /
                                          assigned_priors_wh)
# here we divide by the "variance" of wh of the prior boxes
encoded_box[:, 2:4][assign_mask] /= assigned_priors[:, -2:]

May I ask: What are these "variances" on a conceptual level and why are they used? Why not just use the \hat{g} definitions? Thanks to this thread, I understand that they are incorporated into the training pipeline, but on a conceptual level I cannot match these variances with:

any concept outlined in the paper,
any mathematical definition of variance that I know of.

How are these variances initialized? I.e., why are they set to [0.1] as a default in PriorBox?

I'd really appreciate some clarification here! Thank you.

villanuevab on 23 May 2017

👍5

@oarriaga I edited my question to be more precise; I think you addressed my confusion on why they are included in the code, even if they are not in the paper. Thank you!

Can you elaborate on why they are called "variances" and how they are initialized?

villanuevab on 23 May 2017

I believe the term variances is misleading and they should be called something like "box_scale_factors" or at least thats how I call it on my SSD implementation

oarriaga on 23 May 2017

rykov8 on 23 May 2017

👍8 ❤7 🎉4 😄3 🚀1

Hi @rykov8
when i evaluate the model and look into the numbers the network return
i can see that the variance is the same always..
is that ok?
if it's suppose to be same.. way not save as default parameters and not as network output?

MicBA on 8 Jun 2017

When I am testing the assign boxes function I get negative coordinates for the assigned bounding boxes which is caused due to the encoding after the IOU step inside encode_box function.

variances are coefficients for encoding/decoding the locations of bounding boxes.
The first value is used to encode/decode coordinates of the centers.
The second value is used to encode/decode the sizes of bounding boxes.

Why the encoding/decoding is needed ? is it for using less parameters while training (2 instead of 4 )? for faster optimization ? less calculations ?

stavBodik on 29 Nov 2017

👍2

In my opinion, most of the implementations on bounding box encoding/decoding with variance were conceptually incorrect. Please check my post at https://leimao.github.io/blog/Bounding-Box-Encoding-Decoding/ and provide some feedback on how you think. Thank you.