Darknet: Delta and binary cross-entropy loss

Created on 1 Oct 2018 · 13Comments · Source: AlexeyAB/darknet

@AlexeyAB Thanks for your support.
I have a few questions about yolo_layer.c

[1]. In yolo_layer.c, i know that "delta" means gradient.
However, as shown below, Delta is expressed as the difference value.
Is not this a gradient for the MSE loss?

delta[index + 0stride] = scale * (tx - x[index + 0stride]);
delta[index + 1stride] = scale * (ty - x[index + 1stride]);
delta[index + 2stride] = scale * (tw - x[index + 2stride]);
delta[index + 3stride] = scale * (th - x[index + 3stride]);
l.delta[obj_index] = - l.output[obj_index];
l.delta[obj_index] = 1 - l.output[obj_index];
delta[index + striden] = (((n == class_id) ? 1 : 0) - output[index + striden]);

In YOLOv3 paper, The author mentioned the following.
"During training we use binary cross-entropy loss for the class predictions."
Why does the class loss function above mean binary cross-entropy?

[2]. Is the following "l.cost" used for back-propagation? Or is it simply for print value?

*(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);

[3]. I want to change YOLOv3 to output additional information. So, I am trying to modify the loss function. In this case, should I fill the "delta" of yolo_layer.c with the gradient of the desired loss function such as log-likelihood or Binary cross-entropy?
Besides this, Is there anything else to consider?
I'm sorry to have to ask you a question not related to the code. But I'm a beginner... I would like to listen to your advice.
Thank you very much.

Explanations

Source

doobidoob

Most helpful comment

@doobidoob Hi,

In general, there are two types of classification:

multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3
multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2

For independent outputs (x,y,w,h,t0 and multi-label classifications as in yolo v3) is better to use Binary cross-entropy, when each bounded box can predict several objects at a time: https://stats.stackexchange.com/a/288456/111998 So we use logistic activation (sigmoid) as logistic regression algorithm for binary classification: yes/no car, yes/no person, yes/no dog,... So for a single bounded box can be: person(yes), car(yes), dog(no) - for example, if in a single bounded box there are person and car: https://www.reddit.com/r/learnmachinelearning/comments/88g8zf/difference_between_binary_cross_entropy_and/

for multi-label classifications is used Binary cross-entropy:
delta = (n == class_id) ? (1 - logistic_activation(x)) : (-logistic_activation(x)) ;
for multi-class classification is used Categorical cross entropy:
delta = (n == class_id) ? (1 - softmax(x, x_array)) : (-softmax(x, x_array)) ;

This *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2); is used only for print value as avg loss. This is summary loss for (x,y,w,h,t0,probabilities...) for all anchors for all final activations
If you want to change loss function to get another result during training - then you should change l.delta=...

AlexeyAB on 1 Oct 2018

👍9 ❤1 🎉1

All 13 comments

@doobidoob Hi,

In general, there are two types of classification:

multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3
multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2

For independent outputs (x,y,w,h,t0 and multi-label classifications as in yolo v3) is better to use Binary cross-entropy, when each bounded box can predict several objects at a time: https://stats.stackexchange.com/a/288456/111998 So we use logistic activation (sigmoid) as logistic regression algorithm for binary classification: yes/no car, yes/no person, yes/no dog,... So for a single bounded box can be: person(yes), car(yes), dog(no) - for example, if in a single bounded box there are person and car: https://www.reddit.com/r/learnmachinelearning/comments/88g8zf/difference_between_binary_cross_entropy_and/

for multi-label classifications is used Binary cross-entropy:
delta = (n == class_id) ? (1 - logistic_activation(x)) : (-logistic_activation(x)) ;
for multi-class classification is used Categorical cross entropy:
delta = (n == class_id) ? (1 - softmax(x, x_array)) : (-softmax(x, x_array)) ;

This *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2); is used only for print value as avg loss. This is summary loss for (x,y,w,h,t0,probabilities...) for all anchors for all final activations
If you want to change loss function to get another result during training - then you should change l.delta=...

AlexeyAB on 1 Oct 2018

👍9 ❤1 🎉1

@AlexeyAB Thanks for your reply!
I have a few questions about your mention.

[1]. When using binary cross-entropy, why should "(1-logistic_activation(x))" or "(-logistic_activation(x))" be applied to the delta?
[2]. Why is 1 subtracted when n is class_id?
[3]. Why is "(-logistic_activation(x))" when n is not class_id?
[4]. And why not use "logistic_gradient(x)"?
I know that "delta" means gradient...
Am i misunderstanding?

I want to change the loss function, but it is not easy to apply it to the code...
Probably because I did not fully understand it.
[5]. I want to use a negative log likelihood for additional output except YOLO original output. What should I do on "delta"?

Thanks in advance for your advice.

doobidoob on 15 Oct 2018

@doobidoob

You can read about it here:

https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression
https://github.com/unsky/focal-loss#cross-entropy-with-softmax

AlexeyAB on 17 Oct 2018

👍1

There is Binary cross-entropy loss = −(t*ln(y) + (1−t)*ln(1−y)) - we should minimize it
Also d(loss)/d(y) = loss_derivative = y-t : https://peterroelants.github.io/posts/cross-entropy-logistic/

y - probability [0 - 1]
t - is the class correct 1 or not 0

We do it here: https://github.com/AlexeyAB/darknet/blob/527578744b46666fb5cd42393bf9e1fa9af126ee/src/yolo_layer.c#L143
The same:

t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-y
t==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y

Free-form reasoning - in general, in the Yolo v3:

| | |
|---|---|

we use Binary cross-entropy for multi-label classification: loss = −(y*ln(p) + (1−y)*ln(1−p)) and we should minimize it
- p - probability [0 - 1]
- y - is the class correct 1 or not 0
so we should minimize cost: loss = −ln(p) if(y==1) or loss = −ln(1−p) if(y==0), those
- if(y==1) then should be -ln(p)=0, those p=1
- if(y==0) then should be -ln(1-p)=0, those p=0
then we can do it by maximizing p if(y==1) or maximizing 1-p if(y==0),
- if(y==1) then we should maximize logistic_activation(x + delta) so delta > 0
- if(y==0) then we should minimize logistic_activation(x + delta) so delta < 0
we do it here: https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L120
The same:
- if (detected_class == thruth_class) delta = 1-p > 0
- if (detected_class != thruth_class) delta = −p < 0

where is p = logistic_activation(x) = output[index + stride*n]

In general, there are two types of classification:

multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3
multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2

There is used Binary cross-entropy with Logistic activation (sigmoid) for multi-label classification in the Yolo v3, so each bonded box (each anchor) can have several classes. For example, one bounded box can be Animal, Cat or Truck, Car. Or even Cat, Dog if they are close to each other.

So:

There is used Logistic activation (sigmoid) = 1./(1. + exp(-x)) because:
- probabilities of several classes can be ~=1 - so we can detect Cat, Dog in one box (multi-label)
- nonlinearity - this is a prerequisite for use in the Neural Networks - A. N. Gorban and D. C. Wunsch, "The General Approximation Theorem," 1998: http://scholarsmine.mst.edu/cgi/viewcontent.cgi?article=2908&context=ele_comeng_facwork
For the neural networks, our result states that the function of neuron activation must be nonlinear - and nothing else. Whatever this nonlinearity is, the network of connections can be constructed, and coefficients of linear connections between the neurons can be adjusted in such a way that the neural network will compute any continuous function from its input signals with any given accuracy.
- derivative is very simple = (1-x)*x
There is used Binary Classification, binary - means that we look at each class separately, and we consider each class as 2 classes (There is or There is no). So we use this formula: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
loss = −(y*log(p) + (1−y)*log(1−p))
- log - the natural log ln
- y - binary indicator (0 or 1) if class label c is the correct classification for observation o
- p - predicted probability observation o is of class c

So:

if (detected_class == thruth_class) loss = −log(p)
if (detected_class != thruth_class) loss = −log(1−p)

Where is p = logistic_activation(x), this is output[index + stride*n] in the yolo_layer.c source code.
And we should minimize cost: loss = −log(p) or loss = −log(1−p).

As said in the MXNET doc: https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html

if (detected_class == thruth_class) we should maximize log(p), i.e. we should maximize p
if (detected_class != thruth_class) we should maximize (1−p)

https://peterroelants.github.io/posts/cross-entropy-logistic/
https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression
https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html
https://en.wikipedia.org/wiki/Logistic_regression

AlexeyAB on 2 Jan 2019

👍5 ❤1

This is very similar to the Yolo v2 with Categorical cross-entropy and Softmax activation for multi-class classification: https://peterroelants.github.io/posts/cross-entropy-softmax/

We do it here: https://github.com/AlexeyAB/darknet/blob/527578744b46666fb5cd42393bf9e1fa9af126ee/src/region_layer.c#L154
The same:

t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-y
t==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y

AlexeyAB on 2 Jan 2019

Hi @AlexeyAB

Thanks for your detailed explanation. I wonder for Binary cross-entropy, why the regularization is not included for the loss function, the smooth l1 loss is added for the bounding box although.

i-chaochen on 20 Nov 2019

@i-chaochen

the smooth l1 loss is added for the bounding box although.

What do you mean?

AlexeyAB on 20 Nov 2019

@i-chaochen

the smooth l1 loss is added for the bounding box although.

What do you mean?

Sorry @AlexeyAB maybe I didn't say it clearly.

I mean is any regularization, like l1-norm or l2-norm, for the loss function in the bounding box regression or object classification? (as the overall loss function for YOLO is the sum of squares of deltas for all)

For smooth l1 loss, I mean in the following links, they mentioned SSD, fast/faster rcnn are used it for box regression, and R-CNN and SPPNet used L2 loss. So, I wonder why the regularization is not added to classification.
https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html
https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf

As a side note, I am not sure whether YOLO has used any regularization loss for bounding box regression?

Hope it's clear to you. Thanks

i-chaochen on 20 Nov 2019

question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc

LucWuytens on 9 Jun 2020

question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc

Interested, could you upload one this kind of picture, "two labels for the same box", please?

i-chaochen on 9 Jun 2020

@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB

LucWuytens on 9 Jun 2020

@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB

I see your point. I don't think I ever meet this kind of problem and always have one label for one box. You might have a look that how to use API for Yolo as DLL and SO libraries.
https://github.com/AlexeyAB/darknet#how-to-use-yolo-as-dll-and-so-libraries

If you're using softmax at the last cost function layer, it shall be one class label only as softmax choosing the maximum one.

threshold is for NMS, which is to replace the redundant and overlapping bounding boxes rather than the class label.

i-chaochen on 9 Jun 2020

@LucWuytens You can implement this in your application code - reject one of detection with lower confidence_score if bboxes are equal.

AlexeyAB on 9 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings