Darknet: Detection on many varied-size objects

Created on 8 Jan 2018 · 18Comments · Source: AlexeyAB/darknet

Alexey thank you very much for this repo. I have also browsed the closed issues and found your responses very helpful (moreso than Google).

I wanted to ask some specific questions about detecting objects in high (but also varying) counts, with varying sizes and potentially closely overlapping bounding boxes. I have around 200 images that contain logs. The number of logs per image may vary from perhaps 5 to 1000. This also means that the size of the logs varies – usually if there is a big stack of 1000 then they are small (zoomed-out) and if there are only a couple the photo is more zoomed in.

To detect the small logs does my minimum bounding box need to be (1x1)px when the image is resized to the network_input_size (e.g. 416) or does it need to be (1x1)px in the (39x39)px final feature-map (in the case of darknet)?
With the case of logs stacked on top of each other, is there a non-max suppression setting I would need to lower to allow my bounding-boxes to overlap?
To deal with the above problems I have increased the network input size for darknet from (416, 416) to (4163, 4163). However, then I want to make sure that yolo can output the maximum sized bounding-box (i.e. for zoomed in logs) but I don’t know how to calculate that? You mention in another post that the receptive field for yolo-voc.cfg is 358 x 358. I don’t know how that is calculated (is it just counting the kernel size through the depth of the network?), however does this mean that if a bounding-box (ground-truth) is bigger than (358, 358) after the image is resized to the network_input_size (e.g. 4163, 4163) then it is physically impossible to get IOU=1.0 for that box?
I just wanted to check that it makes sense to add the original anchors2 and original anchors3 to the configuration file if the network_input_size is increased by 3? I thought since these are calculated for the VOC data they don’t really matter since they obviously won’t match my training images. Is it recommended to run clustering on my images to come up with perhaps 15 new anchors? I’m not sure how much these affect accuracy.
In the configuration file if I increase my “anchors” to 15 pairs and then change “num=5” to “num=15” I get an error saying my params don’t match my output size. However, this goes away if I change “num=15” back to “num=5”. Could you please explain what this param refers to (I guess it’s not the number of anchors)?
During the subdivision output does the “count” field in some way limit the number of detections that can be made? For example in the original repo it is “l.truths = 30(l.coords + 1)”, from region_layer.c, where cords=4 by default in the .cfg file. In your fork it is: “l.truths = 30(5)”. So they are equal. However, does this somehow mean that a maximum of 150 detections can be made for one image?
I’m not sure how “thresh = .6” is used from the .cfg file. Is this threshold used to calculate the recall figure during the subdivision output and also perhaps the max-detections (from question 4.) use this value also?
The “./darknet detector recall” command in the original repo appears broken with a hard-coded link, fortunately it works perfectly fine in your repo. However, I’m not sure how the recall is calculated since I don’t see an option to vary the threshold (I have altered thresh =.6 in the .cfg file but it didn’t seem to have any effect)
As an addition to increasing the network_input_size, I am also trying to train with densenet. You mention that “densenet201-yolo2.cfg has receptive field ~1934 x 1934”. Going back to my earlier question 3, this seems to suggest that detecting larger objects won’t be a problem. Would this also help me with the smaller objects? And perhaps in general increase the detections?
Is there an easier way to get precision, recall, and mAP rather than using pytorch repo and running:

python valid.py cfg/voc.data cfg/yolo-voc.cfg yolo-voc.weights
python scripts/voc_eval.py results/comp4_det_test_

Thanks very much, Ilia

question

Source

ilkarman

Most helpful comment

@ilkarman Yes, you can use this line int dim = (rand() % 30 + 30) * 32; for random resolution 960x960 to 1920x1920.
And yes, you should comment this line //if (get_current_batch(net)+100 > net.max_batches) dim = 544;
And you should set random=1 in your cfg-file.

If you use cfg-file based on yolo-voc.2.0.cfg without adding/removing layers, and network resolution is set to width=1248 height=1248, then final feature map will be 39x39.

On the input of neural network for best detection should be min_obj_size >= 32 and max_obj_size <= 566.

If we want to know MIN and MAX obj size on the image (because images has a different resolution), then we use coefficient image_width/net_width:

bounding_box_min_width >= 32*image_width/1248
or the same bounding_box_min_width >= image_width/39
bounding_box_max_width <= 566*image_width/1248
or the same bounding_box_max_width <= image_width/2.2

AlexeyAB on 12 Jan 2018

👍2 ❤1

All 18 comments

@ilkarman Hi,

Do you mean anchor size? Anchors sizes should be calculated for the final feature-map 13x13 (for 416x416 network size), and 39x39 (for 1248x1248 network size - if used yolo-voc.cfg: 1248/32=39). But anchor size is a float value so it can be less than 1.0
Yes, there is non-max suppression - nms param. The lower the nms, then less bounded boxes generated:
for images: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/src/detector.c#L474
for video: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/src/demo.c#L73
More accurate calculations look like this (I corrected that my post):
tiny-yolo-voc: 318 x 318
yolo-voc.cfg: 566 x 566
yolo9000.cfg: 545 x 545

How to calculate Receptive field: receptive_field.xlsx

34670497-9a364796-f487-11e7-8956-1b2585ea0411

Receptieve field mean, that one final activation can't see more than this window (566 x 566 for yolo-voc.cfg), but neural network can know that this is only small part of large object, and bounded box can be more than 566 x 566, so you can get IoU = ~1.
The receptive field must be of sufficient size to say what kind of object it is, and what part of the object is visible. Then neural network can work well.

Yes, it is very desirable to calculate the anchors for your images.
The minimum values of anchors should approximately correspond to the minimum size of objects (in the final feature-map).
And the maximum values of anchors should approximately correspond to the maximum size of objects (in the final feature-map)
If you changed number of anchors - then you should change final filters param and then should train your model from scratch: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/cfg/yolo-voc.2.0.cfg#L224
filters=(classes + 5)*num_of_anchors
This param l.truths = 30*(5) limits the number of objects that can be found only for training not for detection: https://github.com/AlexeyAB/darknet/blob/c1904068afc431ca54771e5dc20f2c588e876956/src/region_layer.c#L30
Because it is used after this line: https://github.com/AlexeyAB/darknet/blob/c1904068afc431ca54771e5dc20f2c588e876956/src/region_layer.c#L177
You can try to increase it, but I didn't try to increase it: https://github.com/AlexeyAB/darknet/issues/313#issuecomment-354285142
Just leave this default thresh = 0.6. Some explanation of cfg params: https://github.com/AlexeyAB/darknet/issues/279#issuecomment-347002399
Threshold for Recall is hardcoded here - you can change this value and recompile: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/src/detector.c#L402
densenet201-yolo2.cfg should help to detect very small and very large objects on the same image, but I didn't test well it. Probably this will require many more images and iterations for good learning.
No, there are no well-functioning functions for get accuracy (mAP, Recall, PR-curve, True/False-Positive/Negative, ...). I plan to add them in the future.

AlexeyAB on 8 Jan 2018

@AlexeyAB thank you very much for such a comprehensive response. You have pretty much cleared up all my questions and your response along with Excel is really useful for me!

ilkarman on 8 Jan 2018

@AlexeyAB

I tested the model with modified anchors many times, both training and testing (training and testing with new anchors), but none could achieve the results better than the original anchors, do you know why?

VanitarNordic on 12 Jan 2018

@VanitarNordic Are the sizes of your custom objects very different to the Pascal VOC data? Did you also generate 5 clusters or did you try a few more?

Can it be that since the network learns to predict ground-truth boxes relative to the location of the anchor-boxes it doesn't matter too much what the initial starting values are as it will compensate for it whilst training? Or is the prediction of a bounding-box bounded somehow by the original anchor specified (e.g. logistic function bounds output to a max of 1).

However this then means that supplying new anchors for testing phase only will through the network off because the offsets it has learnt (relative to old anchors) will now be completely different.

ilkarman on 12 Jan 2018

@VanitarNordic

I tested the model with modified anchors many times, both training and testing (training and testing with new anchors), but none could achieve the results better than the original anchors, do you know why?

Did you train model on your own dataset or on Pascal VOC dataset?
Did you compare your model trainded by yourself with model trained by Joseph?
Did you use random=1?
How did you get your own anchors, did you use this script? https://github.com/Jumabek/darknet_scripts/blob/master/gen_anchors.py

AlexeyAB on 12 Jan 2018

@ilkarman
@AlexeyAB

I used 5 anchors. I have one class of apples. So you can imagine the shapes and ground truths. I used this script to generate anchors for images: https://github.com/Jumabek/darknet_scripts

Did you compare your model trainded by yourself with model trained by Joseph?

No, I tested my own model when I used the original anchors and when I used the new anchors

Did you use random=1?

No, for the both experiments, random was equal to 0.

VanitarNordic on 12 Jan 2018

👍1

@AlexeyAB

Is there a limit on the predicted offsets of ground-truth relative to anchor boxes or can they be any number? E.g. if your anchor is way off, the network will just predict a much bigger offset?

Since you mention random=1; I want to add multi-scale training to my training example. My base network-size is 1248 (for some reason increasing this even higher makes perf worse which I don't understand, also if it's lower then my small objects can't be detected). I want to implement multi-scale training and was thinking of editing the src to this:

while(get_current_batch(net) < net.max_batches){
    if(l.random && count++%10 == 0){
        printf("Resizing\n");
        // Fixed to 320-608

        //int dim = (rand() % 10 + 10) * 32;
        // Since my network size is 1248, maybe try:
        // 32*30 to 32*60 = 960 to 1920
        int dim = (rand() % 30 + 30) * 32;

        // Don't know what this does ... it fixes to 544, 544 after a certain number?
        // Should I just comment out?
        //if (get_current_batch(net)+100 > net.max_batches) dim = 544;
        printf("%d\n", dim);
        args.w = dim;
        args.h = dim;

@VanitarNordic

What is your network size? Also curious how you judge performance ... by visually inspecting a few images or mAP?

ilkarman on 12 Jan 2018

@VanitarNordic No any idea why is this happen.
Sometime later I will add my own code for validation that will give us mAP, Precision-Recall, avg-IOU, True/False - Positive/Negative. So it will be more clear how exactly the accuracy changes.

AlexeyAB on 12 Jan 2018

👍1

If you use cfg-file based on yolo-voc.2.0.cfg without adding/removing layers, and network resolution is set to width=1248 height=1248, then final feature map will be 39x39.

On the input of neural network for best detection should be min_obj_size >= 32 and max_obj_size <= 566.

If we want to know MIN and MAX obj size on the image (because images has a different resolution), then we use coefficient image_width/net_width:

bounding_box_min_width >= 32*image_width/1248
or the same bounding_box_min_width >= image_width/39
bounding_box_max_width <= 566*image_width/1248
or the same bounding_box_max_width <= image_width/2.2

AlexeyAB on 12 Jan 2018

👍2 ❤1

Since I have so many detections, would it make sense to increase the num_clusters from 5 to maybe 10 or even more?

Will generating anchors automatically take care of your comment:

Are there any parameters that limit the total number of objects that can be detected in the image (that I need to increase)?

Finally, I see some conflicting reports on what to use as the base configuration file -> yolo-voc.cfg or yolo-voc.2.0.cfg? I use yolo-voc.cfg as base. They same to have different anchor values also so I'm not sure if I've accidentally generated anchors for voc.2.0 but using voc.cfg settings for example. Sorry I find this a bit confusing

It make sense to increase num of anchors only if many objects are placed very close (in the same cell on the final feature map).
Generating anchors automatically take care about it.
During detection - forward inference - maximum objects that can be detected calculated as final_feature_map * anchors.

If you use yolo-voc.2.0.cfg with resolution 1248x1248, then final_feature_map=39x39.
And if you use 10 anchors.
Then maximum number objects on the image that can be detected = 39x39x10= 15210

Many people can not train yolo-voc.cfg well. So I recommends to base your cfg on yolo-voc.2.0.cfg

AlexeyAB on 12 Jan 2018

@AlexeyAB Thanks once again for your detailed response.

@VanitarNordic
I was wondering whether there is an error in this script to calculate anchors:

    for i in range(anchors.shape[0]):
        anchors[i][0]*=width_in_cfg_file/32.
        anchors[i][1]*=height_in_cfg_file/32.

I thought the input to the network is padded when resized to preserve the aspect ratio. I notice this because my visualised anchor boxes are rectangles but my ground-truth boxes are all squares. However, this is consistent if the network resizes without preserving aspect

ilkarman on 13 Jan 2018

@ilkarman

No, I did not face any error, but in my case, the same as you, I have square rectangles for objects (apples), but anchors are rectangle. I had a discussion there about this you may find it on that repo issues.

VanitarNordic on 13 Jan 2018

@VanitarNordic Right, interesting discussion. I guess the answer depends on whether this implementation of YOLO preserves aspect ratio or not when resizing. If it does then maybe that explains why performance deteriorates when using custom rectangular anchors for square ground-truth bounding boxes.

It seems from AlexeyAB's post that this version does not keep aspect but pjreddie's version does. In which case the rectangular boxes may be correct. I think?

ilkarman on 13 Jan 2018

Yes, this fork doesn't keep aspect ratio, because I use it for Training and Detection on the images with the same resolution, therefore, distortions are identical. For example I train the network on frames from video-stream.
Also I can use neural network with the same resolution as Training and Detection frames.

AlexeyAB on 13 Jan 2018

👍1

:-)

Finally, I did not understand which one is good :-(

problem is with the repo or with the generated anchors.

VanitarNordic on 13 Jan 2018

@VanitarNordic I think the problem would only come if you use that script to generate anchors for pjreddie's version ?

ilkarman on 13 Jan 2018

@ilkarman

The goal is to make the anchors in accordance with our custom images. When I generate using that script, the results get worse.

VanitarNordic on 13 Jan 2018

and I am sure the results are getting worse, because I test on the same images, videos for the both experiments, with original anchors or with new generated anchors using the above mentioned script.

VanitarNordic on 22 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings