From the paper:
During training we use sum of squared error loss. If the ground truth for some coordinate prediction is _tˆ_ our gradient is the ground truth value (computed from the ground truth box) minus our prediction: _tˆ − t*_. This ground truth value can be easily computed by inverting the equations
above.
I see this 'inverting of the equations' in calculating WH loss delta in many repositories, because , obviously, log function is used to return to _tw_ an _th_.
But I can't see the logit function used for returning to _tx_ and _ty_.
What is this _tx_ here?
float tx = (truth.x*lw - i);
Is it _tx_ or sigmoid(_tx_)?
It is even funnier in qqwweee's repository, where he apparently uses _xy_ ground truth as sigmoid with the raw _xy_ net output in binary crossentropy.
Am I missing something?
I also found #2733 where this issue was elaborated on and buried with no satisfactory answer.
I think the 'delta' in a form of x - sigmoid(x) can't converge to anything meaningful. And I also think that such issues sitting at the very heart of Yolov3 repositories with no interest from anyone is troubling.
I tried to implement logit function to get (tˆ* − t*) delta in qqwweee's repository and at the first glance the loss became smaller. I'll investigate and report.
@tabmoo Hi,
I tried to implement logit function to get (tˆ* − t*) delta in qqwweee's repository and at the first glance the loss became smaller. I'll investigate and report.
Can you give an URL?
What changes have you made?
Did you try to train with this code and get higher final mAP?
delta[index + 0*stride] = scale * (tx - x[index + 0*stride]) * (1 - x[index + 0*stride]) * x[index + 0*stride];
delta[index + 1*stride] = scale * (ty - x[index + 1*stride])* (1 - x[index + 1*stride]) * x[index + 1*stride];
Hi, @AlexeyAB !
I changed this code:
to this:
for l in range(num_layers):
object_mask = y_true[l][..., 4:5]
true_class_probs = y_true[l][..., 5:]
grid, raw_pred, pred_xy, pred_wh = yolo_head(yolo_outputs[l],
anchors[anchor_mask[l]], num_classes, input_shape, calc_loss=True)
pred_box = K.concatenate([pred_xy, pred_wh])
# Darknet raw box to calculate loss.
raw_true_xy = -K.log((1. / (y_true[l][..., :2]*grid_shapes[l][::-1] - grid)) - 1.)
raw_true_xy = K.switch(tf.is_inf(raw_true_xy), raw_true_xy, K.zeros_like(raw_true_xy))
raw_true_xy = K.switch(tf.is_nan(raw_true_xy), raw_true_xy, K.zeros_like(raw_true_xy))
raw_true_wh = K.log(y_true[l][..., 2:4] / anchors[anchor_mask[l]] * input_shape[::-1] + 1e-10)
raw_true_wh = K.switch(tf.is_inf(raw_true_wh), raw_true_wh, K.zeros_like(raw_true_wh))
raw_true_wh = K.switch(tf.is_nan(raw_true_wh), raw_true_wh, K.zeros_like(raw_true_wh))
box_loss_scale = 2 - y_true[l][...,2:3]*y_true[l][...,3:4]
My line is with '-K.log', other changes are tweaks by others from that repo to avoid infs and nans when no data augmentation is used.
I'm not a programmer and only have a fairly limited expirience with Python so I don't use and can't comment on your repository (for now, at least). My questions are purely from a math standpoint. Also, this repo is active and advanced (and you're Russian as well as I am) in contrast with Python repos which are 'dead'.
I can try to change the above python code to implement your interesting delta (what is a math basis for it, btw?). For that, though, I need to know if tx is sigmoid (as I understand, it is) as well as if 'x' is sigmoid.
@tabmoo
I can show you specific changes in the code that you can try and compare these 3 options by training your neural network on them and comparing the final accuracy of the MAP (not loss).
See below, add one of these changes, recompile, and train your model: https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
I can try to change the above python code to implement your interesting delta (what is a math basis for it, btw?).
This is just a suggestion for improvement, from the topic to which you referred: https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-485242183
For that, though, I need to know if tx is sigmoid (as I understand, it is) as well as if 'x' is sigmoid.
tx - just truth (it shouldn't use sigmoid activation)x[] - is input of yolo-layer, that is activated by using sigmoidAlso,
We should use 1 of 2 ways:
(log(p / (1 - p))) (inverse sigmoid function) : https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-501391811Since we use
float tw = log(truth.w*w / biases[2 * n]); `
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
instead of
x[index + 2 * stride] = exp(x[index + 2*stride]) * biases[2*n] / w;
float tw = truth.w;
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
Then may be we should use: logit-function: https://en.wikipedia.org/wiki/Logit
x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i);
tx = log(tx / (1 - tx)); // Logit = (log(p / (1 - p)))
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
instead of
x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i);
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
So final code insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be
if (iou_loss == MSE) // old loss
{
float tx = (truth.x*lw - i);
tx = log(tx / (1 - tx)); // Logit = (log(p / (1 - p)))
float ty = (truth.y*lh - j);
ty = log(ty / (1 - ty )); // Logit = (log(p / (1 - p)))
float tw = log(truth.w*w / biases[2 * n]);
float th = log(truth.h*h / biases[2 * n + 1]);
// accumulate delta
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer;
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
}
(1-x)*x: https://github.com/AlexeyAB/darknet/issues/2733#issuecomment-485242183 may this is related to "we must use chain rule to multiply the derivative of the inner function by the outer" https://medium.com/@pdquant/all-the-backpropagation-derivatives-d5275f727f60 and https://en.wikipedia.org/wiki/Chain_rule and https://hmkcode.com/ai/backpropagation-step-by-step/then may be we should use: https://en.wikipedia.org/wiki/Logistic_function#Derivative
x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i);
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer *
(1 - x[index + 0 * stride]) * x[index + 0 * stride];
instead of
x[index + 0 * stride] = logistic( x[index + 0 * stride] );
float tx = (truth.x*lw - i);
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
So final code insead of https://github.com/AlexeyAB/darknet/blob/b8605bda1e1665eaa7649ac944efb57f38a2e2dd/src/yolo_layer.c#L145 will be
if (iou_loss == MSE) // old loss
{
float tx = (truth.x*lw - i);
float ty = (truth.y*lh - j);
float tw = log(truth.w*w / biases[2 * n]);
float th = log(truth.h*h / biases[2 * n + 1]);
// accumulate delta
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer *
(1 - x[index + 0 * stride]) * x[index + 0 * stride];
delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer *
(1 - x[index + 1* stride]) * x[index + 1 * stride];
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
}
if (iou_loss == MSE) // old loss
{
float tx = (truth.x*lw - i);
tx = log(tx / (1 - tx)); // Logit = (log(p / (1 - p)))
float ty = (truth.y*lh - j);
ty = log(ty / (1 - ty )); // Logit = (log(p / (1 - p)))
float tw = log(truth.w*w / biases[2 * n]);
float th = log(truth.h*h / biases[2 * n + 1]);
// accumulate delta
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer;
delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer;
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
}
if (iou_loss == MSE) // old loss
{
float tx = (truth.x*lw - i);
float ty = (truth.y*lh - j);
float tw = log(truth.w*w / biases[2 * n]);
float th = log(truth.h*h / biases[2 * n + 1]);
// accumulate delta
delta[index + 0 * stride] += scale * (tx - x[index + 0 * stride]) * iou_normalizer *
(1 - x[index + 0 * stride]) * x[index + 0 * stride];
delta[index + 1 * stride] += scale * (ty - x[index + 1 * stride]) * iou_normalizer *
(1 - x[index + 1* stride]) * x[index + 1 * stride];
delta[index + 2 * stride] += scale * (tw - x[index + 2 * stride]) * iou_normalizer;
delta[index + 3 * stride] += scale * (th - x[index + 3 * stride]) * iou_normalizer;
}
Thanks! I'll look into it.
@tabmoo Did you try these ways?
@AlexeyAB
Hi!
I haven't tried your code yet.
As for Python, after reviewing qqwweee's code I understood it much better and found that it is correct. (and my changes aren't correct). Sigmoids are used (and should be used) as the arguments in his binary crossentropy code. As I understand it now, my suggestion of logits in the xy delta is applicable for MSE loss only.
@tabmoo
As for Python, after reviewing qqwweee's code I understood it much better and found that it is correct.
Better than what? Did you compare the mAP?
As I understand it now, my suggestion of logits in the xy delta is applicable for MSE loss only.
What do you mean? Where is this not applicable?