Alexey thank you very much for this repo. I have also browsed the closed issues and found your responses very helpful (moreso than Google).
I wanted to ask some specific questions about detecting objects in high (but also varying) counts, with varying sizes and potentially closely overlapping bounding boxes. I have around 200 images that contain logs. The number of logs per image may vary from perhaps 5 to 1000. This also means that the size of the logs varies – usually if there is a big stack of 1000 then they are small (zoomed-out) and if there are only a couple the photo is more zoomed in.
python valid.py cfg/voc.data cfg/yolo-voc.cfg yolo-voc.weights
python scripts/voc_eval.py results/comp4_det_test_
Thanks very much, Ilia
@ilkarman Hi,
Do you mean anchor size? Anchors sizes should be calculated for the final feature-map 13x13 (for 416x416 network size), and 39x39 (for 1248x1248 network size - if used yolo-voc.cfg: 1248/32=39). But anchor size is a float value so it can be less than 1.0
Yes, there is non-max suppression - nms param. The lower the nms, then less bounded boxes generated:
for video: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/src/demo.c#L73
More accurate calculations look like this (I corrected that my post):
How to calculate Receptive field: receptive_field.xlsx

Receptieve field mean, that one final activation can't see more than this window (566 x 566 for yolo-voc.cfg), but neural network can know that this is only small part of large object, and bounded box can be more than 566 x 566, so you can get IoU = ~1.
The receptive field must be of sufficient size to say what kind of object it is, and what part of the object is visible. Then neural network can work well.
Yes, it is very desirable to calculate the anchors for your images.
The minimum values of anchors should approximately correspond to the minimum size of objects (in the final feature-map).
And the maximum values of anchors should approximately correspond to the maximum size of objects (in the final feature-map)
If you changed number of anchors - then you should change final filters param and then should train your model from scratch: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/cfg/yolo-voc.2.0.cfg#L224
filters=(classes + 5)*num_of_anchors
This param l.truths = 30*(5) limits the number of objects that can be found only for training not for detection: https://github.com/AlexeyAB/darknet/blob/c1904068afc431ca54771e5dc20f2c588e876956/src/region_layer.c#L30
Because it is used after this line: https://github.com/AlexeyAB/darknet/blob/c1904068afc431ca54771e5dc20f2c588e876956/src/region_layer.c#L177
You can try to increase it, but I didn't try to increase it: https://github.com/AlexeyAB/darknet/issues/313#issuecomment-354285142
Just leave this default thresh = 0.6. Some explanation of cfg params: https://github.com/AlexeyAB/darknet/issues/279#issuecomment-347002399
Threshold for Recall is hardcoded here - you can change this value and recompile: https://github.com/AlexeyAB/darknet/blob/aeb15b3cb9157f5d0b2a9962e17de22560b8a1b2/src/detector.c#L402
densenet201-yolo2.cfg should help to detect very small and very large objects on the same image, but I didn't test well it. Probably this will require many more images and iterations for good learning.
No, there are no well-functioning functions for get accuracy (mAP, Recall, PR-curve, True/False-Positive/Negative, ...). I plan to add them in the future.
@AlexeyAB thank you very much for such a comprehensive response. You have pretty much cleared up all my questions and your response along with Excel is really useful for me!
@AlexeyAB
I tested the model with modified anchors many times, both training and testing (training and testing with new anchors), but none could achieve the results better than the original anchors, do you know why?
@VanitarNordic Are the sizes of your custom objects very different to the Pascal VOC data? Did you also generate 5 clusters or did you try a few more?
Can it be that since the network learns to predict ground-truth boxes relative to the location of the anchor-boxes it doesn't matter too much what the initial starting values are as it will compensate for it whilst training? Or is the prediction of a bounding-box bounded somehow by the original anchor specified (e.g. logistic function bounds output to a max of 1).
However this then means that supplying new anchors for testing phase only will through the network off because the offsets it has learnt (relative to old anchors) will now be completely different.
@VanitarNordic
I tested the model with modified anchors many times, both training and testing (training and testing with new anchors), but none could achieve the results better than the original anchors, do you know why?
@ilkarman
@AlexeyAB
I used 5 anchors. I have one class of apples. So you can imagine the shapes and ground truths. I used this script to generate anchors for images: https://github.com/Jumabek/darknet_scripts
Did you compare your model trainded by yourself with model trained by Joseph?
No, I tested my own model when I used the original anchors and when I used the new anchors
Did you use random=1?
No, for the both experiments, random was equal to 0.
@AlexeyAB
Is there a limit on the predicted offsets of ground-truth relative to anchor boxes or can they be any number? E.g. if your anchor is way off, the network will just predict a much bigger offset?
Since you mention random=1; I want to add multi-scale training to my training example. My base network-size is 1248 (for some reason increasing this even higher makes perf worse which I don't understand, also if it's lower then my small objects can't be detected). I want to implement multi-scale training and was thinking of editing the src to this:
while(get_current_batch(net) < net.max_batches){
if(l.random && count++%10 == 0){
printf("Resizing\n");
// Fixed to 320-608
//int dim = (rand() % 10 + 10) * 32;
// Since my network size is 1248, maybe try:
// 32*30 to 32*60 = 960 to 1920
int dim = (rand() % 30 + 30) * 32;
// Don't know what this does ... it fixes to 544, 544 after a certain number?
// Should I just comment out?
//if (get_current_batch(net)+100 > net.max_batches) dim = 544;
printf("%d\n", dim);
args.w = dim;
args.h = dim;
@VanitarNordic
What is your network size? Also curious how you judge performance ... by visually inspecting a few images or mAP?
@VanitarNordic No any idea why is this happen.
Sometime later I will add my own code for validation that will give us mAP, Precision-Recall, avg-IOU, True/False - Positive/Negative. So it will be more clear how exactly the accuracy changes.
@ilkarman Yes, you can use this line int dim = (rand() % 30 + 30) * 32; for random resolution 960x960 to 1920x1920.
And yes, you should comment this line //if (get_current_batch(net)+100 > net.max_batches) dim = 544;
And you should set random=1 in your cfg-file.
If you use cfg-file based on yolo-voc.2.0.cfg without adding/removing layers, and network resolution is set to width=1248 height=1248, then final feature map will be 39x39.
On the input of neural network for best detection should be min_obj_size >= 32 and max_obj_size <= 566.
If we want to know MIN and MAX obj size on the image (because images has a different resolution), then we use coefficient image_width/net_width:
bounding_box_min_width >= 32*image_width/1248
or the same bounding_box_min_width >= image_width/39
bounding_box_max_width <= 566*image_width/1248
or the same bounding_box_max_width <= image_width/2.2
- Since I have so many detections, would it make sense to increase the num_clusters from 5 to maybe 10 or even more?
- Will generating anchors automatically take care of your comment:
- Are there any parameters that limit the total number of objects that can be detected in the image (that I need to increase)?
- Finally, I see some conflicting reports on what to use as the base configuration file -> yolo-voc.cfg or yolo-voc.2.0.cfg? I use yolo-voc.cfg as base. They same to have different anchor values also so I'm not sure if I've accidentally generated anchors for voc.2.0 but using voc.cfg settings for example. Sorry I find this a bit confusing
final_feature_map * anchors.If you use yolo-voc.2.0.cfg with resolution 1248x1248, then final_feature_map=39x39.
And if you use 10 anchors.
Then maximum number objects on the image that can be detected = 39x39x10= 15210
@AlexeyAB Thanks once again for your detailed response.
@VanitarNordic
I was wondering whether there is an error in this script to calculate anchors:
for i in range(anchors.shape[0]):
anchors[i][0]*=width_in_cfg_file/32.
anchors[i][1]*=height_in_cfg_file/32.
I thought the input to the network is padded when resized to preserve the aspect ratio. I notice this because my visualised anchor boxes are rectangles but my ground-truth boxes are all squares. However, this is consistent if the network resizes without preserving aspect
@ilkarman
No, I did not face any error, but in my case, the same as you, I have square rectangles for objects (apples), but anchors are rectangle. I had a discussion there about this you may find it on that repo issues.
@VanitarNordic Right, interesting discussion. I guess the answer depends on whether this implementation of YOLO preserves aspect ratio or not when resizing. If it does then maybe that explains why performance deteriorates when using custom rectangular anchors for square ground-truth bounding boxes.
It seems from AlexeyAB's post that this version does not keep aspect but pjreddie's version does. In which case the rectangular boxes may be correct. I think?
Yes, this fork doesn't keep aspect ratio, because I use it for Training and Detection on the images with the same resolution, therefore, distortions are identical. For example I train the network on frames from video-stream.
Also I can use neural network with the same resolution as Training and Detection frames.
:-)
Finally, I did not understand which one is good :-(
problem is with the repo or with the generated anchors.
@VanitarNordic I think the problem would only come if you use that script to generate anchors for pjreddie's version ?
@ilkarman
The goal is to make the anchors in accordance with our custom images. When I generate using that script, the results get worse.
and I am sure the results are getting worse, because I test on the same images, videos for the both experiments, with original anchors or with new generated anchors using the above mentioned script.
Most helpful comment
@ilkarman Yes, you can use this line
int dim = (rand() % 30 + 30) * 32;for random resolution 960x960 to 1920x1920.And yes, you should comment this line
//if (get_current_batch(net)+100 > net.max_batches) dim = 544;And you should set
random=1in your cfg-file.If you use cfg-file based on
yolo-voc.2.0.cfgwithout adding/removing layers, and network resolution is set to width=1248 height=1248, then final feature map will be 39x39.On the input of neural network for best detection should be
min_obj_size >= 32andmax_obj_size <= 566.If we want to know MIN and MAX obj size on the image (because images has a different resolution), then we use coefficient
image_width/net_width:bounding_box_min_width >= 32*image_width/1248or the same
bounding_box_min_width >= image_width/39bounding_box_max_width <= 566*image_width/1248or the same
bounding_box_max_width <= image_width/2.2