Darknet: waht's the meaning of 'mask' in YOLOV3.cfg [yolo]

Created on 26 Mar 2018  Â·  7Comments  Â·  Source: pjreddie/darknet

waht's the meaning of 'mask' in YOLOV3.cfg [yolo]

[yolo]
mask = 6,7,8

[yolo]
mask = 3,4,5


[yolo]
mask = 0,1,2

  if(mask) l.mask = mask;
    else{
        l.mask = calloc(n, sizeof(int));
        for(i = 0; i < n; ++i){
            l.mask[i] = i;
        }
    }

@pjreddie @Broham

Most helpful comment

Every layer has to know about all of the anchor boxes but is only predicting some subset of them. This could probably be named something better but the mask tells the layer which of the bounding boxes it is responsible for predicting. The first yolo layer predicts 6,7,8 because those are the largest boxes and it's at the coarsest scale. The 2nd yolo layer predicts some smallers ones, etc.

All 7 comments

Every layer has to know about all of the anchor boxes but is only predicting some subset of them. This could probably be named something better but the mask tells the layer which of the bounding boxes it is responsible for predicting. The first yolo layer predicts 6,7,8 because those are the largest boxes and it's at the coarsest scale. The 2nd yolo layer predicts some smallers ones, etc.

The layer assumes if it isn't passed a mask that it is responsible for all the bounding boxes, hence the if statement thing.

@pjreddie Thanks a lot ! One more question, how do we calculate the results of the three [detection] results?

i'm not quite sure what you mean, could you clarify?

maybe this helps, the [yolo] layers simply apply logistic activation to some of the neurons, mainly the ones predicting (x,y) offset, objectness, and class probabilities. then if you call get_yolo_detections (or something like that) it interprets the output as described in the paper.

@pjreddie
For example, if there is a dog in a picture, int yolov2, we can adjust the threshold to get the best detection result. it contains 4 bounding box offsets, 1 objectness prediction, and 80 class predictions. However, yolov3 will get a 3-d tensor. there are 3 [detection] layers. yolov3 will predict 3 boxes at each scale so the tensor is N×N×[3∗(4+1+80)] for the 4 bounding box offsets, 1 objectness prediction, and 80 class predictions.

Thus, how do we dispose the 3-d tensor to predict one object?

@pjreddie https://github.com/pjreddie, mask => coanchors?

On Mon, Mar 26, 2018 at 5:01 PM, jwnsu notifications@github.com wrote:

This is similar to Retinanet, an object is assigned to a specific
layer/scale based on size of the object during training, each of the yolo
layers handles different scales (i.e. intended to predicts object of
different sizes )-- it's also ok for 2 layers to predict same object, as
follow-on nms will resolve them. You can check Facebook's Retinanet paper
for more info.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/pjreddie/darknet/issues/558#issuecomment-376293053,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA9cxQVUZGqylnE8IhQtLqx9wtYogzrks5tiUkmgaJpZM4S6fjI
.

--
Regards,

Luiz Vitor Martinez Cardoso

"The only limits are the ones you place upon yourself"

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sayanmutd picture sayanmutd  Â·  3Comments

MaverickLoneshark picture MaverickLoneshark  Â·  3Comments

arianaa30 picture arianaa30  Â·  3Comments

HoracceFeng picture HoracceFeng  Â·  3Comments

job2003 picture job2003  Â·  3Comments