Darknet: questions about bounding boxes and grids

Created on 25 Feb 2018  路  7Comments  路  Source: AlexeyAB/darknet

Hi AlexeyAB,

Thank you very much for sharing this project to people, especially for your patience to answer all the questions. 馃憤

Hope you can help me with these questions:

  1. Sometimes, there will be two bounding boxes for the same object.
    They don't overlap a lot with each other, and it seems like one of them is there because of the latency of frames (not really sure). Would you please provide any suggestion to fix this?
  1. I understand that the number of grids will change along with the size of input image.
    I just wonder if you can point out where to change it in the code, and provide any possible suggestion regarding how to decide the number of grids.
    (I think it's related to the number of possible overlapping and complexity of regions to be recognized.)

  2. Bounding boxes shake a lot when there's a recognition - it's probably because of the process of box-regression. Is there any way to stabilize them a little bit more? (say...use anchors at approximately the same size to targets?)

Thank you!

Most helpful comment

Hi @jackwei0117

  1. You can try to decrease this param nms=0.1 - it means that if two bounded boxes are overlaped more than 10% then there will stay only one bounded box with highest probability: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L74

  2. Output grid size depends on:

    • input network size
    • number of maxpool layer with stride=2
      Output grid size = (input network size) / pow(2, (number of maxpool with stride=2) )

    In both yolo-voc.2.0.cfg and tiny-yolo-voc.cfg there is 416x416 input size, and there are 5 maxpool layers with stride=2. So output size = 416x416 / pow(2, 5) = 416x416 / 32 = 13x13
    For each cell of 13x13 there are 5 anchors.

  3. You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97

    int j;
    for (j = 0; j < l.w*l.h*l.n; ++j) {
        boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
        boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
    }



(say...use anchors at approximately the same size to targets?)

What do you mean?

All 7 comments

Hi @jackwei0117

  1. You can try to decrease this param nms=0.1 - it means that if two bounded boxes are overlaped more than 10% then there will stay only one bounded box with highest probability: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L74

  2. Output grid size depends on:

    • input network size
    • number of maxpool layer with stride=2
      Output grid size = (input network size) / pow(2, (number of maxpool with stride=2) )

    In both yolo-voc.2.0.cfg and tiny-yolo-voc.cfg there is 416x416 input size, and there are 5 maxpool layers with stride=2. So output size = 416x416 / pow(2, 5) = 416x416 / 32 = 13x13
    For each cell of 13x13 there are 5 anchors.

  3. You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97

    int j;
    for (j = 0; j < l.w*l.h*l.n; ++j) {
        boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
        boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
    }



(say...use anchors at approximately the same size to targets?)

What do you mean?

Thank you very much AlexeyAB, you really saved me !!!

As for Q.3, what I mean is: If my targets are all at about [50,50] <=50 by 50 pixels
Will setting my anchors to [48,48], [49, 49], [50, 50], [51, 51], [52, 52] help reducing the shaking problem?
(rather than [5, 5], [25, 25], [75, 75], [100,100],[200, 200] )

Probably, I think yes, but you should re-train after that anchors are changed.

And do not forget that any input image is resized to 416x416, so if your object size [50,50] in the source image that has resolution 640x480, then anchor will be 13 x [50,50] / [640, 480] = 1.01, 1.35

What anchors can you get using this command?
C:\Python27\python.exe gen_anchors.py -filelist data/train.txt -output_dir data/anchors -num_clusters 5

AlexeyAB,

Where does this "13" comes from?
"13 x [50,50] / [640, 480] = 1.01, 1.35"

Thank you for the information, I'll try to run the command later.

According to the default YOLO V2 cfg, the input image is downsized from 416x416 to 32x32 size features.

416/32=13

416x416 -> Input image size
32x32 -> Final conv layer feature size

sivagnanamn,

Understood, thank you very much for your explanation.

  1. You can add here code, that decreases precision of width and height: https://github.com/AlexeyAB/darknet/blob/e96a454ca11f140a7f7fb82daefe4cc9555a0f26/src/demo.c#L97
  int j;
  for (j = 0; j < l.w*l.h*l.n; ++j) {
      boxes[j].w = ((float)((int)(20 * boxes[j].w))+0.5) / 20.0F;
      boxes[j].h = ((float)((int)(20 * boxes[j].h))+0.5) / 20.0F;
  }

Hi @AlexeyAB

For last version of demo.c, is this code still working ? I put this code here but i am not sure that is right place. If not where is the right place or is there any other way to stabilize the boxes.
I'll be glad if you help me

https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/demo.c#L247

int j;
for (j = 0; j < l.w*l.h*l.n; ++j) {
    local_nboxes[j].w = ((float)((int)(20 * local_nboxes[j].w))+0.5) / 20.0F;
    local_nboxes[j].h = ((float)((int)(20 * local_nboxes[j].h))+0.5) / 20.0F;
}
Was this page helpful?
0 / 5 - 0 ratings

Related issues

Greta-A picture Greta-A  路  3Comments

yongcong1415 picture yongcong1415  路  3Comments

shootingliu picture shootingliu  路  3Comments

hemp110 picture hemp110  路  3Comments

zihaozhang9 picture zihaozhang9  路  3Comments