Darknet: questions about bounding boxes and grids

Created on 25 Feb 2018 · 7Comments · Source: AlexeyAB/darknet

Hi AlexeyAB,

Thank you very much for sharing this project to people, especially for your patience to answer all the questions. 👍

Hope you can help me with these questions:

Sometimes, there will be two bounding boxes for the same object.
They don't overlap a lot with each other, and it seems like one of them is there because of the latency of frames (not really sure). Would you please provide any suggestion to fix this?

I understand that the number of grids will change along with the size of input image.
I just wonder if you can point out where to change it in the code, and provide any possible suggestion regarding how to decide the number of grids.
(I think it's related to the number of possible overlapping and complexity of regions to be recognized.)
Bounding boxes shake a lot when there's a recognition - it's probably because of the process of box-regression. Is there any way to stabilize them a little bit more? (say...use anchors at approximately the same size to targets?)

Thank you!

Source

jackwei0117

Most helpful comment

Hi @jackwei0117

You can try to decrease this param nms=0.1 - it means that if two bounded boxes are overlaped more than 10% then there will stay only one bounded box with highest probability: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L74
Output grid size depends on:
- input network size
- number of maxpool layer with stride=2
  Output grid size = (input network size) / pow(2, (number of maxpool with stride=2) )
In both yolo-voc.2.0.cfg and tiny-yolo-voc.cfg there is 416x416 input size, and there are 5 maxpool layers with stride=2. So output size = 416x416 / pow(2, 5) = 416x416 / 32 = 13x13
For each cell of 13x13 there are 5 anchors.
You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97

    int j;
    for (j = 0; j < l.w*l.h*l.n; ++j) {
        boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
        boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
    }

If your objects do not move too fast, then you can try to increase this param #define FRAMES 7 - it means that final feature map (output grid) will be averaged over 7 frames so it will change smoothly - but this will increase latency : https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L18

Or you can generate more anchors, for example 10 instead of 5: C:\Python27\python.exe gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 10
put generated anchors here: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/cfg/yolo-voc.2.0.cfg#L228
and put here number of anchors num=10: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/cfg/yolo-voc.2.0.cfg#L232

(say...use anchors at approximately the same size to targets?)

What do you mean?

AlexeyAB on 25 Feb 2018

❤2 👍2

All 7 comments

Hi @jackwei0117

You can try to decrease this param nms=0.1 - it means that if two bounded boxes are overlaped more than 10% then there will stay only one bounded box with highest probability: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L74
Output grid size depends on:
- input network size
- number of maxpool layer with stride=2
  Output grid size = (input network size) / pow(2, (number of maxpool with stride=2) )
In both yolo-voc.2.0.cfg and tiny-yolo-voc.cfg there is 416x416 input size, and there are 5 maxpool layers with stride=2. So output size = 416x416 / pow(2, 5) = 416x416 / 32 = 13x13
For each cell of 13x13 there are 5 anchors.
You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97

    int j;
    for (j = 0; j < l.w*l.h*l.n; ++j) {
        boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
        boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
    }

If your objects do not move too fast, then you can try to increase this param #define FRAMES 7 - it means that final feature map (output grid) will be averaged over 7 frames so it will change smoothly - but this will increase latency : https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L18

Or you can generate more anchors, for example 10 instead of 5: C:\Python27\python.exe gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 10
put generated anchors here: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/cfg/yolo-voc.2.0.cfg#L228
and put here number of anchors num=10: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/cfg/yolo-voc.2.0.cfg#L232

(say...use anchors at approximately the same size to targets?)

What do you mean?

AlexeyAB on 25 Feb 2018

❤2 👍2

Thank you very much AlexeyAB, you really saved me !!!

As for Q.3, what I mean is: If my targets are all at about [50,50] <=50 by 50 pixels
Will setting my anchors to [48,48], [49, 49], [50, 50], [51, 51], [52, 52] help reducing the shaking problem?
(rather than [5, 5], [25, 25], [75, 75], [100,100],[200, 200] )

jackwei0117 on 26 Feb 2018

Probably, I think yes, but you should re-train after that anchors are changed.

And do not forget that any input image is resized to 416x416, so if your object size [50,50] in the source image that has resolution 640x480, then anchor will be 13 x [50,50] / [640, 480] = 1.01, 1.35

What anchors can you get using this command?
C:\Python27\python.exe gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 5

gen_anchors.py is here: https://github.com/AlexeyAB/darknet/blob/master/build/darknet/x64/gen_anchors.py
Python 2.x: https://www.python.org/downloads/release/python-2714/
And probably then do C:\Python27\Scripts\pip install numpy

AlexeyAB on 26 Feb 2018

❤1 👍1

AlexeyAB,

Where does this "13" comes from?
"13 x [50,50] / [640, 480] = 1.01, 1.35"

Thank you for the information, I'll try to run the command later.

jackwei0117 on 26 Feb 2018

According to the default YOLO V2 cfg, the input image is downsized from 416x416 to 32x32 size features.

416/32=13

416x416 -> Input image size
32x32 -> Final conv layer feature size

sivagnanamn on 26 Feb 2018

👍1

sivagnanamn,

Understood, thank you very much for your explanation.

jackwei0117 on 26 Feb 2018

You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97
  int j;
  for (j = 0; j < l.w*l.h*l.n; ++j) {
      boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
      boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
  }

Hi @AlexeyAB

For last version of demo.c, is this code still working ? I put this code here but i am not sure that is right place. If not where is the right place or is there any other way to stabilize the boxes.
I'll be glad if you help me

https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/demo.c#L247

int j;
for (j = 0; j < l.w*l.h*l.n; ++j) {
    local_nboxes[j].w = ((float)((int)(20 * local_nboxes[j].w))+0.5) / 20.0F;
    local_nboxes[j].h = ((float)((int)(20 * local_nboxes[j].h))+0.5) / 20.0F;
}