Darknet: XY Loss delta: t^* - t* or sigmoid(t^) - sigmoid(t)?

Created on 27 Dec 2019 · 9Comments · Source: AlexeyAB/darknet

From the paper:

During training we use sum of squared error loss. If the ground truth for some coordinate prediction is _tˆ_ our gradient is the ground truth value (computed from the ground truth box) minus our prediction: _tˆ − t*_. This ground truth value can be easily computed by inverting the equations
above.

I see this 'inverting of the equations' in calculating WH loss delta in many repositories, because , obviously, log function is used to return to _tw_ an _th_.

But I can't see the logit function used for returning to _tx_ and _ty_.

What is this _tx_ here?

float tx = (truth.x*lw - i);

Is it _tx_ or sigmoid(_tx_)?

It is even funnier in qqwweee's repository, where he apparently uses _xy_ ground truth as sigmoid with the raw _xy_ net output in binary crossentropy.

Am I missing something?

question

Source

tabmoo

All 9 comments

I also found #2733 where this issue was elaborated on and buried with no satisfactory answer.

I think the 'delta' in a form of x - sigmoid(x) can't converge to anything meaningful. And I also think that such issues sitting at the very heart of Yolov3 repositories with no interest from anyone is troubling.

I tried to implement logit function to get (tˆ* − t*) delta in qqwweee's repository and at the first glance the loss became smaller. I'll investigate and report.

tabmoo on 28 Dec 2019

@tabmoo Hi,

I tried to implement logit function to get (tˆ* − t*) delta in qqwweee's repository and at the first glance the loss became smaller. I'll investigate and report.

Can you give an URL?
What changes have you made?
Did you try to train with this code and get higher final mAP?

delta[index + 0*stride] = scale * (tx - x[index + 0*stride]) * (1 - x[index + 0*stride]) * x[index + 0*stride];
delta[index + 1*stride] = scale * (ty - x[index + 1*stride])* (1 - x[index + 1*stride]) * x[index + 1*stride];

AlexeyAB on 28 Dec 2019

👍1

Hi, @AlexeyAB !

I changed this code:

https://github.com/qqwweee/keras-yolo3/blob/e6598d13c703029b2686bc2eb8d5c09badf42992/yolo3/model.py#L371-L383

to this:

for l in range(num_layers):
        object_mask = y_true[l][..., 4:5]
        true_class_probs = y_true[l][..., 5:]

        grid, raw_pred, pred_xy, pred_wh = yolo_head(yolo_outputs[l],
             anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True)
        pred_box = K.concatenate([pred_xy, pred_wh])

        # Darknet raw box to calculate loss.
        raw_true_xy = -K.log((1. / (y_true[l][..., :2]*grid_shapes[l][::-1] - grid)) - 1.)
        raw_true_xy = K.switch(tf.is_inf(raw_true_xy), raw_true_xy, K.zeros_like(raw_true_xy))
        raw_true_xy = K.switch(tf.is_nan(raw_true_xy), raw_true_xy, K.zeros_like(raw_true_xy))

        raw_true_wh = K.log(y_true[l][..., 2:4] / anchors[anchor_mask[l]] * input_shape[::-1] + 1e-10)
        raw_true_wh = K.switch(tf.is_inf(raw_true_wh), raw_true_wh, K.zeros_like(raw_true_wh))
        raw_true_wh = K.switch(tf.is_nan(raw_true_wh), raw_true_wh, K.zeros_like(raw_true_wh))
        box_loss_scale = 2 - y_true[l][...,2:3]*y_true[l][...,3:4]

My line is with '-K.log', other changes are tweaks by others from that repo to avoid infs and nans when no data augmentation is used.

I'm not a programmer and only have a fairly limited expirience with Python so I don't use and can't comment on your repository (for now, at least). My questions are purely from a math standpoint. Also, this repo is active and advanced (and you're Russian as well as I am) in contrast with Python repos which are 'dead'.

I can try to change the above python code to implement your interesting delta (what is a math basis for it, btw?). For that, though, I need to know if tx is sigmoid (as I understand, it is) as well as if 'x' is sigmoid.

tabmoo on 28 Dec 2019

@tabmoo

I can show you specific changes in the code that you can try and compare these 3 options by training your neural network on them and comparing the final accuracy of the MAP (not loss).

original code
added Logit
added Logistic-derivative

See below, add one of these changes, recompile, and train your model: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

I can try to change the above python code to implement your interesting delta (what is a math basis for it, btw?).

This is just a suggestion for improvement, from the topic to which you referred: https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-485242183

For that, though, I need to know if tx is sigmoid (as I understand, it is) as well as if 'x' is sigmoid.

tx - just truth (it shouldn't use sigmoid activation)
x[] - is input of yolo-layer, that is activated by using sigmoid

https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145-L157

Also,

We should use 1 of 2 ways:

Either - as I understand there is suggested to use Logit-function (log(p / (1 - p))) (inverse sigmoid function) : https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-501391811

Since we use

float tw = log(truth.w*w / biases[2 * n]); ` 
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;

instead of

x[index + 2 * stride] =  exp(x[index + 2*stride]) * biases[2*n]   / w;
float tw = truth.w; 
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;

Then may be we should use: logit-function: https://en.wikipedia.org/wiki/Logit

x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i); 
tx = log(tx / (1 - tx));  // Logit = (log(p / (1 - p)))
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;

instead of

x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i); 
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;

So final code insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be

    if (iou_loss == MSE)    // old loss
    {
        float tx = (truth.x*lw - i);
        tx = log(tx / (1 - tx));  // Logit = (log(p / (1 - p)))
        float ty = (truth.y*lh - j);
        ty = log(ty / (1 - ty ));  // Logit = (log(p / (1 - p)))
        float tw = log(truth.w*w / biases[2 * n]);
        float th = log(truth.h*h / biases[2 * n + 1]);

        // accumulate delta
        delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
        delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer;
        delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
        delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
    }

Or, when backpropagating, we must multiply by the Logistic-gradient (1-x)*x: https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-485242183 may this is related to "we must use chain rule to multiply the derivative of the inner function by the outer" https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60 and https://en.wikipedia.org/wiki/Chain_rule and https://hmkcode.com/ai/backpropagation-step-by-step/

then may be we should use: https://en.wikipedia.org/wiki/Logistic_function#Derivative

x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i); 
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer * 
    (1 - x[index + 0 * stride]) * x[index + 0 * stride];

instead of

x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i); 
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;

So final code insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be

    if (iou_loss == MSE)    // old loss
    {
        float tx = (truth.x*lw - i);
        float ty = (truth.y*lh - j);
        float tw = log(truth.w*w / biases[2 * n]);
        float th = log(truth.h*h / biases[2 * n + 1]);

        // accumulate delta
        delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer * 
    (1 - x[index + 0 * stride]) * x[index + 0 * stride];  
        delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer * 
    (1 - x[index + 1* stride]) * x[index + 1 * stride];  
        delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
        delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
    }

AlexeyAB on 28 Dec 2019

👍1

The 1st propose - insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be

    if (iou_loss == MSE)    // old loss
    {
        float tx = (truth.x*lw - i);
        tx = log(tx / (1 - tx));  // Logit = (log(p / (1 - p)))
        float ty = (truth.y*lh - j);
        ty = log(ty / (1 - ty ));  // Logit = (log(p / (1 - p)))
        float tw = log(truth.w*w / biases[2 * n]);
        float th = log(truth.h*h / biases[2 * n + 1]);

        // accumulate delta
        delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
        delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer;
        delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
        delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
    }

The 2nd propose - insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be

    if (iou_loss == MSE)    // old loss
    {
        float tx = (truth.x*lw - i);
        float ty = (truth.y*lh - j);
        float tw = log(truth.w*w / biases[2 * n]);
        float th = log(truth.h*h / biases[2 * n + 1]);

        // accumulate delta
        delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer * 
    (1 - x[index + 0 * stride]) * x[index + 0 * stride];  
        delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer * 
    (1 - x[index + 1* stride]) * x[index + 1 * stride];  
        delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
        delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
    }

AlexeyAB on 31 Dec 2019

Thanks! I'll look into it.

tabmoo on 31 Dec 2019

@tabmoo Did you try these ways?

AlexeyAB on 30 Jan 2020

@AlexeyAB

Hi!

I haven't tried your code yet.

As for Python, after reviewing qqwweee's code I understood it much better and found that it is correct. (and my changes aren't correct). Sigmoids are used (and should be used) as the arguments in his binary crossentropy code. As I understand it now, my suggestion of logits in the xy delta is applicable for MSE loss only.

tabmoo on 30 Jan 2020

@tabmoo