@AlexeyAB Thanks for your support.
I have a few questions about yolo_layer.c
[1]. In yolo_layer.c, i know that "delta" means gradient.
However, as shown below, Delta is expressed as the difference value.
Is not this a gradient for the MSE loss?
In YOLOv3 paper, The author mentioned the following.
"During training we use binary cross-entropy loss for the class predictions."
Why does the class loss function above mean binary cross-entropy?
[2]. Is the following "l.cost" used for back-propagation? Or is it simply for print value?
[3]. I want to change YOLOv3 to output additional information. So, I am trying to modify the loss function. In this case, should I fill the "delta" of yolo_layer.c with the gradient of the desired loss function such as log-likelihood or Binary cross-entropy?
Besides this, Is there anything else to consider?
I'm sorry to have to ask you a question not related to the code. But I'm a beginner... I would like to listen to your advice.
Thank you very much.
@doobidoob Hi,
In general, there are two types of classification:
multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3
multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2
for multi-label classifications is used Binary cross-entropy:
delta = (n == class_id) ? (1 - logistic_activation(x)) : (-logistic_activation(x)) ;
for multi-class classification is used Categorical cross entropy:
delta = (n == class_id) ? (1 - softmax(x, x_array)) : (-softmax(x, x_array)) ;
This *(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2); is used only for print value as avg loss. This is summary loss for (x,y,w,h,t0,probabilities...) for all anchors for all final activations
If you want to change loss function to get another result during training - then you should change l.delta=...
@AlexeyAB Thanks for your reply!
I have a few questions about your mention.
[1]. When using binary cross-entropy, why should "(1-logistic_activation(x))" or "(-logistic_activation(x))" be applied to the delta?
[2]. Why is 1 subtracted when n is class_id?
[3]. Why is "(-logistic_activation(x))" when n is not class_id?
[4]. And why not use "logistic_gradient(x)"?
I know that "delta" means gradient...
Am i misunderstanding?
I want to change the loss function, but it is not easy to apply it to the code...
Probably because I did not fully understand it.
[5]. I want to use a negative log likelihood for additional output except YOLO original output. What should I do on "delta"?
Thanks in advance for your advice.
@doobidoob
You can read about it here:
There is Binary cross-entropy loss = −(t*ln(y) + (1−t)*ln(1−y)) - we should minimize it
Also d(loss)/d(y) = loss_derivative = y-t : https://peterroelants.github.io/posts/cross-entropy-logistic/
y - probability [0 - 1]t - is the class correct 1 or not 0We do it here: https://github.com/AlexeyAB/darknet/blob/527578744b46666fb5cd42393bf9e1fa9af126ee/src/yolo_layer.c#L143
The same:
t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-yt==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y
Free-form reasoning - in general, in the Yolo v3:
|
|
|
|---|---|
loss = −(y*ln(p) + (1−y)*ln(1−p)) and we should minimize itp - probability [0 - 1]y - is the class correct 1 or not 0loss = −ln(p) if(y==1) or loss = −ln(1−p) if(y==0), thoseif(y==1) then should be -ln(p)=0, those p=1if(y==0) then should be -ln(1-p)=0, those p=0then we can do it by maximizing p if(y==1) or maximizing 1-p if(y==0),
if(y==1) then we should maximize logistic_activation(x + delta) so delta > 0if(y==0) then we should minimize logistic_activation(x + delta) so delta < 0we do it here: https://github.com/pjreddie/darknet/blob/61c9d02ec461e30d55762ec7669d6a1d3c356fb2/src/yolo_layer.c#L120
The same:
delta = 1-p > 0delta = −p < 0where is p = logistic_activation(x) = output[index + stride*n]
In general, there are two types of classification:
multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model >= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3
multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model >= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2
There is used Binary cross-entropy with Logistic activation (sigmoid) for multi-label classification in the Yolo v3, so each bonded box (each anchor) can have several classes. For example, one bounded box can be Animal, Cat or Truck, Car. Or even Cat, Dog if they are close to each other.
So:
There is used Logistic activation (sigmoid) = 1./(1. + exp(-x)) because:
~=1 - so we can detect Cat, Dog in one box (multi-label)For the neural networks, our result states that the function of neuron activation must be nonlinear - and nothing else. Whatever this nonlinearity is, the network of connections can be constructed, and coefficients of linear connections between the neurons can be adjusted in such a way that the neural network will compute any continuous function from its input signals with any given accuracy.
= (1-x)*xThere is used Binary Classification, binary - means that we look at each class separately, and we consider each class as 2 classes (There is or There is no). So we use this formula: https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html
loss = −(y*log(p) + (1−y)*log(1−p))
lnSo:
loss = −log(p) loss = −log(1−p) Where is p = logistic_activation(x), this is output[index + stride*n] in the yolo_layer.c source code.
And we should minimize cost: loss = −log(p) or loss = −log(1−p).
As said in the MXNET doc: https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html
log(p), i.e. we should maximize p(1−p)
https://peterroelants.github.io/posts/cross-entropy-logistic/
https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_error_function_and_logistic_regression
https://gluon.mxnet.io/chapter02_supervised-learning/logistic-regression-gluon.html
https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html
https://en.wikipedia.org/wiki/Logistic_regression
This is very similar to the Yolo v2 with Categorical cross-entropy and Softmax activation for multi-class classification: https://peterroelants.github.io/posts/cross-entropy-softmax/
We do it here: https://github.com/AlexeyAB/darknet/blob/527578744b46666fb5cd42393bf9e1fa9af126ee/src/region_layer.c#L154
The same:
t==1: i.e. if (detected_class == thruth_class) delta = -loss_derivative = -(y-t) = 1-yt==0: i.e. if (detected_class != thruth_class) delta = -loss_derivative = -(y-t) = -y
Hi @AlexeyAB
Thanks for your detailed explanation. I wonder for Binary cross-entropy, why the regularization is not included for the loss function, the smooth l1 loss is added for the bounding box although.
@i-chaochen
the smooth l1 loss is added for the bounding box although.
What do you mean?
@i-chaochen
the smooth l1 loss is added for the bounding box although.
What do you mean?
Sorry @AlexeyAB maybe I didn't say it clearly.
I mean is any regularization, like l1-norm or l2-norm, for the loss function in the bounding box regression or object classification? (as the overall loss function for YOLO is the sum of squares of deltas for all)
For smooth l1 loss, I mean in the following links, they mentioned SSD, fast/faster rcnn are used it for box regression, and R-CNN and SPPNet used L2 loss. So, I wonder why the regularization is not added to classification.
https://lilianweng.github.io/lil-log/2018/12/27/object-detection-part-4.html
https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf
As a side note, I am not sure whether YOLO has used any regularization loss for bounding box regression?
Hope it's clear to you. Thanks
question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc
question for @AlexeyAB
You explained that in Yolov3 it is possible that one anchor can detects two classes. This is indeed happening in my dataset: two labels for the same box. This may be desired behavior in many cases, but in my application, I would like to visualize and only keep the label with the highest probability (without using -'thresh'). After all usually the second label is incorrect and results in FP, reducing the overall metrics. Is there some 'setting' that can accomplish this?
thanks,
Luc
Interested, could you upload one this kind of picture, "two labels for the same box", please?
@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB
@i-chaochen
I can't really upload my pictures, but you can see it for yourself using the youtube video that AlexeyAB also shared somewhere:
https://www.youtube.com/watch?v=69Ii3HjUiTM
You can see objects with one box and two labels, for example: car, taxi.
This is actually something I would not like to see in my output, but only the highest probability for each anchor box. The lower probability alternative labels also result in False Positives, potentially impacting the mAP calculation? Hence my question to @AlexeyAB
I see your point. I don't think I ever meet this kind of problem and always have one label for one box. You might have a look that how to use API for Yolo as DLL and SO libraries.
https://github.com/AlexeyAB/darknet#how-to-use-yolo-as-dll-and-so-libraries
If you're using softmax at the last cost function layer, it shall be one class label only as softmax choosing the maximum one.
threshold is for NMS, which is to replace the redundant and overlapping bounding boxes rather than the class label.
@LucWuytens You can implement this in your application code - reject one of detection with lower confidence_score if bboxes are equal.
Most helpful comment
@doobidoob Hi,
In general, there are two types of classification:
multi-label classification - each bounded box (each anchor) can have several classes. And in total there are in the neural model
>= 1 classes. There is used Binary cross-entropy with Logistic activation (sigmoid). Is used in Yolo v3multi-class classification - each bounded box (each anchor) can have only one classes. And in total there are in the neural model
>= 1 classes. There is used Categorical cross-entropy with Softmax activation. Is used in Yolo v2for multi-label classifications is used Binary cross-entropy:
delta = (n == class_id) ? (1 - logistic_activation(x)) : (-logistic_activation(x)) ;for multi-class classification is used Categorical cross entropy:
delta = (n == class_id) ? (1 - softmax(x, x_array)) : (-softmax(x, x_array)) ;This
*(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2);is used only for print value asavg loss. This is summary loss for (x,y,w,h,t0,probabilities...) for all anchors for all final activationsIf you want to change loss function to get another result during training - then you should change
l.delta=...