Yolov5: Want to figure out critical algorithm of Detect layer

Created on 21 Jul 2020  路  33Comments  路  Source: ultralytics/yolov5

鉂擰uestion

Hi,
I want to figure out the intuition of bbox detection.
In yolov3, we can find that the output can be write by these:
image
image

So, in yolov5,
I look into the src code: https://github.com/ultralytics/yolov5/blob/1e95337f3aec4c12244802bb6e493b07b27aa795/models/yolo.py#L21-L38
And try to formularize it:
image

Am I right?

Stale question

Most helpful comment

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

The original thread is https://github.com/ultralytics/yolov3/issues/168
image

All 33 comments

@ChristopherSTAN yes this looks correct! Typically this would be written as 2sigma() rather than sigma() x 2 though.

@glenn-jocher Awesome!
How you find out out this way to get the prediction. It is so brilliant.
image

The original yolo/darknet box equations have a serious flaw. Width and Height are completely unbounded as they are simply out=exp(in), which is dangerous, as it can lead to runaway gradients, instabilities, NaN losses and ultimately a complete loss of training. yolov3 suffers from this problem as well as yolov4.

For yolov5 I made sure to patch this error by sigmoiding all model outputs, while also ensuring that the centerpoint remained unchanged 1=fcn(0), so nominal zero outputs from the model would cause the nominal anchor size to be used. The current eqn constrains anchor multiples from a minimum of 0 to a maximum of 4, and the anchor-target matching has also been updated to be width-height multiple based, with a nominal upper threshold hyperparameter of 4.0.

The original thread is https://github.com/ultralytics/yolov3/issues/168
image

@ChristopherSTAN BTW, you mentioned you were experimenting with lowering hyp['anchor_t']: 4.0, # anchor-multiple threshold paired with an increase in anchor count. This is an interesting approach, but I just realized it would make sense to take this a step further and modify the actual wh function as well to reduce the range from 0-4 to 0-2, otherwise half of your output space is unused, which is a bad design decision, as your neuron outputs may lose up to half of their precision capability.

You can accomplish this by modifying the exponent in the equation to 1.0, which is mathematically equivalent to removing it altogether:

y[..., 2:4] = (y[..., 2:4] * 2) ** 1.0 * self.anchor_grid[i]  # wh
            = y[..., 2:4] * 2 * self.anchor_grid[i]  # wh 

This change would need to occur in two places: 1) Detect() module, 2) compute_loss() box calculation:
https://github.com/ultralytics/yolov5/blob/1e95337f3aec4c12244802bb6e493b07b27aa795/utils/utils.py#L472-L475

@glenn-jocher I am afraid I have not considered so much LOL. Maybe you are talking about another DL pro.

(Apparently I am not, for now.)

But I will try!
Thanks for your explanation.

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

@ChristopherSTAN don't worry, the idea is pretty simple. A neuron can control outputs in a certain range defined by the above equations, default being 0-4. If you reduce the hyperparameter that controls the matching threshold to 2.0, then boxes are only matched to anchors that are less then 2x the anchor size and greater than 1/2x the anchor size. So if an anchor size is 10 pixels, then that neuron can match labels between 5-20 pixels size, but it can output a box shape from 0-40 pixels size. So it is wasting 5/8 of it's output span. It has to fit all of it's output between 5-20, which by definition gives it less fine control for tiny corrections, which will reduce mAP.

So for best results, you want the neuron to have output authority over the entire training space you want it to predict. Even with the default settings, I see I am wasting a bit of training space. With default settings, the 10 pixel anchor neuron can output sizes between 2.5 - 40, so I am currently wasting 6% of the output space.

@glenn-jocher I follow your idea, and set hyp['anchor_t'] = 3.0, will it work?

Yes, any value will work here, you just need to experiment with what produces the best mAP. If you lower these values though then it would also make sense to adjust the wh equations. For a 3.0 limit you might adjust the equation to this to fully capture the output space:
y[..., 2:4] = (y[..., 2:4] * 2) ** 1.6 * self.anchor_grid[i] # wh

@glenn-jocher For now I am thinking whether I can adjust it to perform well on my datasets, where there are lots of overlapping and medium objects:
image

Can I consider decreasing this parameter (2, 1.73...)is also limiting the size of outputting bounding boxes?

You should look at your labels.png to see your size distribution. Yes, changing exponent in the box equations from 2.0 to 1.6 will limit your output space from 0-4 to 0-3. This would presumably paired with an increase in anchor count, otherwise recall would suffer.

@glenn-jocher Here:
image

And here's another dataset:
image

@ChristopherSTAN yes these look pretty typical. You have some very large class imbalances as well. Or wait, it looks like your bar chart is plotted incorrectly, as there are 15 bins but it only goes up to 13. Looks like a plotting bug.

TODO: Fix labels.png bar chart.

Pushed a commit https://github.com/ultralytics/yolov5/commit/4ffd9779d378392f51321bd41dc88df487d4069b for improved plotting. No bug found in current plotting.

Hi, dear Glenn,

I think it is a good time for your team to formularize, paperize your work, and SHOCK the world. It is really interesting to read your code.

@ChristopherSTAN haha, yes we do need to produce a publication, but we are still exploring design changes. Hopefully around the end of year we can send something to arxiv.

@ChristopherSTAN I have an idea, you could try modifying the L24 activation function in the Conv() layer from LeayReLU(0.1) to Swish() or Mish() to see if this helps wheat training. I've never tried this, but it may be possible to still start from pretraind weights when you do this:

https://github.com/ultralytics/yolov5/blob/5e970d45c44fff11d1eb29bfc21bed9553abf986/models/common.py#L18-L31

EDIT: You'll have to reduce your batch size as these will consume much greater GPU RAM when training.

@glenn-jocher Interesting!
I will try later. Now I am considering using coco dataset to increase training data by extracting intersecting classes. I think it will be a great trick to improve performance on custom datasets. If it works, I will apply a PR to see if you are interested.

Edit: I plan to upload some scripts. I am not sure how to name this operation. Maybe we can name it "Enriching data" or something else.

BTW, I am using EfficientDet on Wheat compete.
But I am using yolov5 on two different datasets.

@glenn-jocher That's my way:
image
Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

@glenn-jocher Now I understand your feeling when training on COCO.
I just use yolov5m and nearly 40K training images, it takes me 35min to run an epoch....

@ChristopherSTAN intersecting classes, that's a good term. Yes this would be very useful. OpenImages V5/6 have a lot of intersecting classes with coco.

Yes, COCO can be very slow to train on unfortunately.

@glenn-jocher That's my way:
image
Here I have a small dataset with 3600 images. But by extracting data from coco, we can have more than 30K. I am expecting how much it affect.

I would point out that this is not something you want to do on the long run, depending on the actual images of your own dataset. The COCO dataset may help the model to generalise on the objects, but usually the test dataset and the real world on which you are going to use your trained model are going to have its specifics around:

  • point of view from where the pictures/videos have been taken
  • lightning
  • overall environment conditions

For the problems I am solving, I have also used the COCO dataset for the specific classes I am training. However, I am also decreasing the COCO images in my dataset once I have a new batch of real images annotated. And, obviously, one thing you need to make sure is not happening is having any COCO images in your val/test set if these are not in accordance to your actual real scenarios. This can screw up your model evaluation pretty bad.

@dlawrences Thanks for your suggestions! It is my first time to add COCO images into my train set. And I have similar thought of test set to yours, I do not add extra images in to val set. Because I still want test set and dev set have same distribution.

Thanks again!

@glenn-jocher I plan to try what this pro said: https://github.com/ultralytics/yolov3/issues/1098#issuecomment-663219984

Try Leaky ReLU first and then Mish.

@ChristopherSTAN ok! I think it's a good idea, because yolov3/4 demonstrated improved mAP with Mish and Swish, it's just using it for COCO training was next to impossible. Finetuning smaller datasets with it may be much more in reach however.

@glenn-jocher And BTW, do you hear about Group Sampling?

Paper: https://openaccess.thecvf.com/content_CVPR_2019/papers/Ming_Group_Sampling_for_Scale_Invariant_Face_Detection_CVPR_2019_paper.pdf

It seems it can improve detection performance with simple implementation.

@glenn-jocher
I do like this:

class Conv(nn.Module):
    # Standard convolution
    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, act=True):  # ch_in, ch_out, kernel, stride, padding, groups
        super(Conv, self).__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p), groups=g, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        #self.act = nn.LeakyReLU(0.1, inplace=True) if act else nn.Identity()
        self.act = Mish() if act else nn.Identity()

But easily run out of CUDA...

So I use your implementation:

self.act = MemoryEfficientMish() if act else nn.Identity()

Still, we need cut a half of the batch size (8->4) with yolov5x.

@ChristopherSTAN hmm yes, the memory requirements are pretty terrible. Swish might be a happy middle ground between ReLU and Mish, as it is also a smooth function like Mish yet requires a lot less memory for training.

Oh, BTW, I just released v2.0, which has a few model and training improvements. YOLOv5x now scores 49.0 on COCO, up from 48.4 before. I'm not sure if this change will affect wheat training positively or negatively though. v2.0 includes breaking changes, as the models are constructed in a simpler way now, so you would need to clone a fresh copy of the repo if you want to start using it.

@glenn-jocher I will start to try today! Long time no update the repo. I think my idea has reach bottleneck. It is time to move on and learn more.

@glenn-jocher Bravo! I first train yolov5x on mixed dataset (30K of COCO + 3K of a small dataset) for nearly 50 epochs. Then train 150 epochs in only 3K dataset. It gives me 0.67 -> 0.71 mAP in test set!

@ChristopherSTAN oh, that's a big jump! What was the increase due to? The COCO pretraining? Mish/Swish was also mentioned above, or perhaps you used your intersecting classes idea?

I don't see much improvement with Mish/Swish.
The story is that:
You know I am only using Colab, and my notebook just disconnected days ago. So I was angry and resumed it without coco dataset. And observed great improvement in val set.

Then I though a little bit: because the data are so important to deep learning models, the external data have improved the modeling ability of the mode. when train in the custom dataset (origin), we can see a great improvement.

Especially, this the single model on fold 0, but outperform my ensemble models on 5 folds.

With this observation, I will keep the pretrain model and resume it with k-fold, then ensemble it.

So, at last, thanks a lot to your great repo and hard working on COCO dataset.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings