Darknet: waht's the meaning of 'mask' in YOLOV3.cfg [yolo]

Created on 26 Mar 2018 · 7Comments · Source: pjreddie/darknet

waht's the meaning of 'mask' in YOLOV3.cfg [yolo]

[yolo]
mask = 6,7,8

[yolo]
mask = 3,4,5


[yolo]
mask = 0,1,2

  if(mask) l.mask = mask;
    else{
        l.mask = calloc(n, sizeof(int));
        for(i = 0; i < n; ++i){
            l.mask[i] = i;
        }
    }

@pjreddie @Broham

Source

AhaEdgar

Most helpful comment

Every layer has to know about all of the anchor boxes but is only predicting some subset of them. This could probably be named something better but the mask tells the layer which of the bounding boxes it is responsible for predicting. The first yolo layer predicts 6,7,8 because those are the largest boxes and it's at the coarsest scale. The 2nd yolo layer predicts some smallers ones, etc.

pjreddie on 26 Mar 2018

👍29

All 7 comments

pjreddie on 26 Mar 2018

👍29

The layer assumes if it isn't passed a mask that it is responsible for all the bounding boxes, hence the if statement thing.

pjreddie on 26 Mar 2018

@pjreddie Thanks a lot ! One more question, how do we calculate the results of the three [detection] results?

AhaEdgar on 26 Mar 2018

i'm not quite sure what you mean, could you clarify?

pjreddie on 26 Mar 2018

maybe this helps, the [yolo] layers simply apply logistic activation to some of the neurons, mainly the ones predicting (x,y) offset, objectness, and class probabilities. then if you call get_yolo_detections (or something like that) it interprets the output as described in the paper.

pjreddie on 26 Mar 2018

@pjreddie
For example, if there is a dog in a picture, int yolov2, we can adjust the threshold to get the best detection result. it contains 4 bounding box offsets, 1 objectness prediction, and 80 class predictions. However, yolov3 will get a 3-d tensor. there are 3 [detection] layers. yolov3 will predict 3 boxes at each scale so the tensor is N×N×[3∗(4+1+80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

Thus, how do we dispose the 3-d tensor to predict one object?

AhaEdgar on 26 Mar 2018

@pjreddie https://github.com/pjreddie, mask => coanchors?

On Mon, Mar 26, 2018 at 5:01 PM, jwnsu notifications@github.com wrote:

This is similar to Retinanet, an object is assigned to a specific
layer/scale based on size of the object during training, each of the yolo
layers handles different scales (i.e. intended to predicts object of
different sizes )-- it's also ok for 2 layers to predict same object, as
follow-on nms will resolve them. You can check Facebook's Retinanet paper
for more info.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pjreddie/darknet/issues/558#issuecomment-376293053,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA9cxQVUZGqylnE8IhQtLqx9wtYogzrks5tiUkmgaJpZM4S6fjI
.