Yolov3: width-height: 'yolo' method vs. 'power' method

Created on 26 Mar 2019 · 17Comments · Source: ultralytics/yolov3

Hi! We are running your YOLO implementation into a 5 class detection task. However, it seems that at some iteration of some epoch (it is not always the same), the loss suddenly starts quickly going to infinite, giving nan values. The term that it seems that is increasing exponentially is the wh loss (wh tensor sometimes has negative values I don't know if this is normal). By applying your power method wh = torch.sigmoid(p[..., 2:4]) # wh (power method) instead of wh = p[..., 2:4] # wh (yolo method) it seems that this problem stops and the algorithm does not diverge. However, the wh loss flattens out at a higher value (around 1.07, 1.08 instead of going down to 0) than the other losses as shown below:

Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
38/59    425/1242   0.00705      1.07   0.00528  0.000134      1.09         4     0.254
38/59    426/1242    0.0071      1.07   0.00527  0.000133      1.09         4     0.258
38/59    427/1242   0.00711      1.08   0.00527  0.000133      1.09         4     0.254
38/59    428/1242   0.00712      1.08   0.00526  0.000133      1.09         3     0.251

Do you know any clue about why this could be happening? What are the supposed advantages of using the power method?

Kind regards.

Stale question

Source

100330706

Most helpful comment

@100330706 yes, these are the two places you need to make the switch, and in ONNX export also if you plan to use that feature. Note that the power method currently trends from 0 to 4 as the input trends from negative infinity to positive infinity. The yolo method has unbounded outputs, causing it diverge on occasion as you've noticed.

If you feel you need higher wh anchor multiples than 4 in your scenario you can use a different exponent (i.e. 2^3 instead of 2^2 to range from 0-9). The beauty of this design, and the reason we selected it, is that the output will always equal 1 when the input equals zero, regardless of the exponent used (i.e. sigmoid(0)^x = 1 regardless of x). This is an important quality that centers the results given a random weight initialization.
comparison

glenn-jocher on 28 Mar 2019

❤2 👍1

All 17 comments

The wh losses are the most unstable in the darknet implementation due to their unbounded nature, as you noticed. We created the 'power' method to stabilize the training in such situations. Typically if the wh losses diverge they do so in the first few epochs.

However, the switch from darknet to power method for the wh paremter needs to be done in multiple places in the code, not just in the loss function. This can make switching back and forth confusing, so we should probably parameterize a switch for these two methods. I'll add this to the TODO list for the next release.

IMPORTANT: If you use the default yolov3.weights you should use the default darknet wh implementation. The power method should only be used when training a new model, and then all inference done on that model must also use the power method.

This issue has a comparison plot of the two methods:
https://github.com/ultralytics/yolov3/issues/12#issuecomment-423531011

glenn-jocher on 26 Mar 2019

@glenn-jocher Appreciate your response. So, occasionally not converging is supposed to be normal when using the darknet method, isn't it? Where else should be applied the power method? So far I've seen it commented in build_targets function and in the yolo layer. Would it work by uncommenting just these and commenting the darknet ones?

100330706 on 27 Mar 2019

glenn-jocher on 28 Mar 2019

❤2 👍1

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.
If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

https://github.com/ultralytics/yolov3/blob/f0b4f9f4fb7af71076cc91c48cdf2c9043395ad8/utils/utils.py#L260-L269

https://github.com/ultralytics/yolov3/blob/f0b4f9f4fb7af71076cc91c48cdf2c9043395ad8/models.py#L158-L165

glenn-jocher on 10 Apr 2019

❤2

so all img values are change to the color default you put in the letterbox function. color = 127.5, 127.5, 127.5. they round up to 128. is it the logic behind it? because there is nothing else that could change all values of img to 128 in coco dataset. what if one uses gray images these numbers should change?

sanazss on 21 Jul 2019

@sanazss this is grey padding.

glenn-jocher on 21 Jul 2019

@glenn-jocher I believe in the newest implementation there are only two places where one needs to make this change (doesn't seem like it needs to be done in build-targets). If this is right, could be useful to update your summary to avoid confusion. If this is wrong, then I'd love to know where else to make the change. My loss currently diverges if using the yolo method, but P, R, mAP and F1 all remain zero when using the power method. Odd.

bchugg on 14 Nov 2019

@bchugg it's true that the darknet/power method is subject to divergence (as is clear from the plot above). The introduction of GIoU has _mostly_ suppressed instances of this happening, though it still does happen on occasion, typically when GIoU hyperparameter or SGD LR are set too high.

You are correct also, the introduction of GIoU removed the w/h calculation from build_targets. I will update the 'TO SUMMARIZE' comment above!

In any case, to keep things simple for you, I would leave the wh method to the default and simply reduce your GIoU gain or SGD LR hyperparameters:

https://github.com/ultralytics/yolov3/blob/a96e010251372ad675c4b0ef0ad4da85da3ade49/train.py#L23-L30

glenn-jocher on 15 Nov 2019

👍1

I've clamped the wh output to max=1E4 now to prevent wh divergence. This should resolve the issue completely now.
https://github.com/ultralytics/yolov3/blob/b027c660489399ad30562e17e2a1a2e008176048/utils/utils.py#L342

glenn-jocher on 24 Nov 2019

@glenn-jocher i was implementing yolov2 from scratch for face detection and encountered the same issue as loss going nan just because of wh_loss term. So i came across this power method u have developed , and started training it using power method , but observed that exploding gradients problem is gone now but the model is not converging to an acceptable optimal state and loss is getting stagnant after some 1000 epochs to 15.xx loss. so i have some questions which i hope u can help me with :
1) while building the targets , the w and h dimensions have to log transformed or just in the scale of grid. by doing something like this :
g_wh = g_wh / (image_w)
2) the power method u have referred to , does it look something like this :
wh_loss = clamp(exp(sigmoid(g_wh)) , 1e4) * anchor_dimension .
if not can you please correct the loss equation .

3) During inference the wh transformation looks something like this :
p_wh = exp(wh) * anchor_dimension .
is this way of inferencing correct.

also , i am using MSE for regression loss

thanks.

agarwalyogeesh on 10 Jan 2020

@agarwalyogeesh the power method seen in https://github.com/ultralytics/yolov3/issues/168#issuecomment-477588965 is in units of grid points. There are no log operations. The equations are in https://github.com/ultralytics/yolov3/issues/168#issuecomment-481724580

Note that GIoU loss implementation seems to fix most of the unstable losses in the original exp wh method, and now we use this combination (GIoU loss with original exp wh).

glenn-jocher on 10 Jan 2020

@glenn-jocher , thanks for your reply,
I understand that using GIOU will eliminate the unstable loss problem . But before implementing that , i wanted to experiment with MSE for regression . and wanted to ask what is K in the loss computation in this comment. https://github.com/ultralytics/yolov3/issues/168#issuecomment-481724580 .

Also in the inference equation u have mentioned above , the range varies from [0 - 8] right ? .
((sigmoid(x) * 2) ** 3) , how is this range compatible with the range used in loss function as k = 1 in loss computation in that equation for power method.

The non converging of the model is what worrying me, maybe its because of less data or maybe i need more training time.

thanks,

agarwalyogeesh on 11 Jan 2020

@agarwalyogeesh k is a hyperparameter, it's tunable. The range can vary from 0 to any number, in our case we set it to 8.

glenn-jocher on 12 Jan 2020

This issue is stale because it has been open 30 days with no activity. Remove Stale label or comment or this will be closed in 5 days.

github-actions[bot] on 8 Mar 2020

TO SUMMARIZE: The 'yolo' width-height (wh) method is the official darknet wh computation method. The 'power' method is a more stable implementation we created to address instability issues arising during training of custom data.

If you are running inference from official weights (i.e. yolov3.weight, yolov3-spp.pt, etc) DO NOT change the wh computation method, leave the 'yolo' method in place.

If you are training custom data and your wh loss is diverging, you can try swaping 'yolo' wh for 'power' wh. This change needs to occur in 2 seperate areas of the code. If in doubt, search your yolov3 path for wh power to find the instances:

https://github.com/ultralytics/yolov3/blob/f0b4f9f4fb7af71076cc91c48cdf2c9043395ad8/utils/utils.py#L260-L269

https://github.com/ultralytics/yolov3/blob/f0b4f9f4fb7af71076cc91c48cdf2c9043395ad8/models.py#L158-L165

Thanks for wonderful explanation and idea to solve instability in training.

I want to point out that the output range of power wh method will be in range [0-8] and that will be multiplied by the anchor width / height.

But this makes / forces the prediction to be greater than or equal to the size of anchor. Not smaller than anchor size. Our prediction should handle / predict smaller or larger boxes than the anchor. So with the current power wh method it is forced to handle / predict only larger (or equal) boxes than the anchor.

So I suggest we should use tanh function which has range of [-1, +1]. We than multiply it by 2 to make the range [-2, 2]. After we square it / cube it, we may get the output range [-4, +4] / [-8, +8].

This may solve that issue and probably provide better results.

meet-minimalist on 15 Apr 2020

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

glenn-jocher on 15 Apr 2020

👍2

@meet-minimalist yes tanh is very similar to sigoid, and it outputs from -1 to 1.

But we are outputing a multiple of the anchor, which currently ranges from 0-inf. The exp(x) method has no upper limit on x, causing the instability, but we need an output floor of 0 in all cases.

The proposed change is (sigmoid(x) * 2) ** 3, which ranges from 0-8, and retains the same centerpoint f(0)=1 as exp(x).

Yeah, you are correct. My bad that I forgot this is a multiplier which need to be positive in any case.
For smaller box than anchor, the prediction value will be between [0-1] and for larger box than anchor, the prediction value will be between [1-8].

Thanks again.

meet-minimalist on 16 Apr 2020

Was this page helpful?

0 / 5 - 0 ratings