Darknet: How to train YOLOv2 544x544 ?

Created on 21 Jan 2017 · 16Comments · Source: AlexeyAB/darknet

thanks alexeyab!
I want more higher mAP, so YOLOv2 544x544 is best choice ,but there is not have train script and cfg file in darknet, Do you have train this resolution?

Source

matakk

👍1

Most helpful comment

As a result:

There is no simple way to train Yolo on resolution larger than 416x416, but you can use .weights-file trained on 416x416, to detect on larger resolution:

1.1. Yolo automatically resizes any images to resolution 416x416, but it makes it impossible to detect small objects

1.2. If you want to detect small object on images with 832x832 or 1088x1088, then simply use .weights-file trained on 416x416, and change lines in your .cfg-file: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L4

From:

subdivisions=8
height=416
width=416

To:

subdivisions=64
height=1088
width=1088

In details:

Yolo v2 has not parameter side, but has parameter num=5 (number of anchors) https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L232
and output of network: filters = (classes + coords + 1)*num https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L224
In new Yolo v2 the number of sides is determined by a resolution network, the number of layers and its strides/steps.
YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13. https://arxiv.org/pdf/1612.08242v1.pdf

And sides will automatically increased from 13 to 34. (changes of the subdivision is only necessary to reduce consumption of GPU-RAM)

On image below:

left cfg-file (classes=6, num=5, filters=55, subdivision=8, width=416, height=416) - as you see output layer (13x13x55)
right cfg-file (classes=6, num=5, filters=55, subdivision=64, width=1088, height=1088) - as you see output layer (34x34x55)

416_to_1088

If you want to increase precision by training with higher resolution, then you can train Yolo with dynamic resolution 320x320 - 608x608 by set flag random=1: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244
This increase mAP +1%: https://arxiv.org/pdf/1612.08242v1.pdf

9ccbd576-f581-11e6-930c-94cf0c312c0a 1

AlexeyAB on 1 Mar 2017

❤2 🎉1 👍1

All 16 comments

@matakk

Simply change these lines in your .cfg file: https://groups.google.com/d/msg/darknet/MumMJ2D8H9Y/UBeJOa-eCwAJ

from

[net]
batch=64
subdivisions=8
height=416
width=416

[net]
batch=64
subdivisions=16
height=544
width=544

If out of memory, then set subdivisions=64.
And then train by using this .cfg-file.

Also you can train for 416x416 but use with 544x544 or more, for example, 832x832.

Trained 416x416, and detection 416x416:

predictions_416x416

Trained 416x416, and detection 832x832:

predictions_832x832

AlexeyAB on 22 Jan 2017

👍2

What needs to change in order to use a pretrained weight with 416x416 to detect with 544x544 or higher?

Thanks,

kaishijeng on 22 Jan 2017

@kaishijeng To use a pretrained weights with 416x416 to detect with 832x832 - we need changes of the same type in your custom .cfg-file or in default yolo.cfg/yolo-voc.cfg:

from

[net]
batch=64
subdivisions=8
height=416
width=416

[net]
batch=64
subdivisions=64
height=832
width=832

AlexeyAB on 22 Jan 2017

What is the purpose of changing subdivisions form 8 to 64?

Thanks,

On Sun, Jan 22, 2017 at 2:00 AM, Alexey notifications@github.com wrote:

@kaishijeng https://github.com/kaishijeng We need the same changes in
your custom .cfg-file or in default yolo.cfg/yolo-voc.cfg:

from

[net]
batch=64
subdivisions=8
height=416
width=416

to

[net]
batch=64
subdivisions=64
height=832
width=832

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/AlexeyAB/darknet/issues/12#issuecomment-274321162,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMGg3mzBiy0njbxhv29j0_CoWS6qQqMBks5rUyjagaJpZM4Lp8eC
.

kaishijeng on 22 Jan 2017

@kaishijeng Higher resolution requires more GPU-memory.

If you get error "out of memory" then you should decrease batch-value or increase subdivisions-value. At once processed batch/subdivisions images.
64/8 requres more than 4 GB GPU-RAM for 832x832 resolution when used cuDNN, and you should use 64/64.

AlexeyAB on 22 Jan 2017

@AlexeyAB Thanks!!
I'am starting to trian with 832,but at start training ,the log has few nan value,
log:

Region Avg IOU: 0.287866, Class: 0.028420, Obj: 0.548111, No Obj: 0.513275, Avg Recall: 0.100000, count: 20
Region Avg IOU: 0.361796, Class: 0.022684, Obj: 0.525597, No Obj: 0.514241, Avg Recall: 0.333333, count: 9
Region Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.513698, Avg Recall: -nan, count: 0
Region Avg IOU: 0.243378, Class: 0.014114, Obj: 0.460395, No Obj: 0.513718, Avg Recall: 0.000000, count: 6
Region Avg IOU: 0.509384, Class: 0.016810, Obj: 0.515260, No Obj: 0.513929, Avg Recall: 0.500000, count: 2

Can I ignore this nan ,and continue train ?
thanks again!

matakk on 6 Feb 2017

@matakk If only in some of the lines occurs nan, I think this is normal. Try to detect each type of objects (at least each object once) after 2000-4000 iterations, and if all ok, then you can continue train.

Also note, that you should use version not earlier than 10 Jan 2017, where was fixed bug: https://github.com/AlexeyAB/darknet/commit/b831db5e58ad37c8f4adf39de7fb5db204209a1c

About nan. Nan occurs here, if count == 0: https://github.com/AlexeyAB/darknet/blob/2fc5f6d46b089368d967b3e1ad6b2473b6dc970e/src/region_layer.c#L320

This may be because:

if every of 30 generated box-truth is wrong if(!truth.x) break;: https://github.com/AlexeyAB/darknet/blob/2fc5f6d46b089368d967b3e1ad6b2473b6dc970e/src/region_layer.c#L262
or if l.batch == 0 here for (b = 0; b < l.batch; ++b), but batch is equal 64 from .cfg-file: https://github.com/AlexeyAB/darknet/blob/2fc5f6d46b089368d967b3e1ad6b2473b6dc970e/src/region_layer.c#L187

AlexeyAB on 6 Feb 2017

@AlexeyAB
Please have a look at #30
@matakk
Have you trained on 544 * 544?

VanitarNordic on 16 Feb 2017

As a result:

There is no simple way to train Yolo on resolution larger than 416x416, but you can use .weights-file trained on 416x416, to detect on larger resolution:

1.1. Yolo automatically resizes any images to resolution 416x416, but it makes it impossible to detect small objects

1.2. If you want to detect small object on images with 832x832 or 1088x1088, then simply use .weights-file trained on 416x416, and change lines in your .cfg-file: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L4

From:

subdivisions=8
height=416
width=416

To:

subdivisions=64
height=1088
width=1088

In details:

Yolo v2 has not parameter side, but has parameter num=5 (number of anchors) https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L232
and output of network: filters = (classes + coords + 1)*num https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L224
In new Yolo v2 the number of sides is determined by a resolution network, the number of layers and its strides/steps.
YOLO’s convolutional layers downsample the image by a factor of 32 so by using an input image of 416 we get an output feature map of 13 × 13. https://arxiv.org/pdf/1612.08242v1.pdf

And sides will automatically increased from 13 to 34. (changes of the subdivision is only necessary to reduce consumption of GPU-RAM)

On image below:

left cfg-file (classes=6, num=5, filters=55, subdivision=8, width=416, height=416) - as you see output layer (13x13x55)
right cfg-file (classes=6, num=5, filters=55, subdivision=64, width=1088, height=1088) - as you see output layer (34x34x55)

416_to_1088

If you want to increase precision by training with higher resolution, then you can train Yolo with dynamic resolution 320x320 - 608x608 by set flag random=1: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244
This increase mAP +1%: https://arxiv.org/pdf/1612.08242v1.pdf

9ccbd576-f581-11e6-930c-94cf0c312c0a 1

AlexeyAB on 1 Mar 2017

❤2 🎉1 👍1

If you still want to train Yolo at high resolution 1088x1088, then you can try this, but it does not provide many guarantees of success:

change this line: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/src/detector.c#L83
- from: int dim = (rand() % 10 + 10) * 32;
- to: int dim = args.w;
set dynamic resolution flag random=1 in your .cfg-file: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L244
change subdivisions, height and width lines in your .cfg-file: https://github.com/AlexeyAB/darknet/blob/76dbdae388a6c269cbf46d28e53fee8ce4ace94d/cfg/yolo-voc.cfg#L4

From:

subdivisions=8
height=416
width=416

To:

subdivisions=64
height=1088
width=1088

train Yolo as usually

You should get .cfg-file look like this, if you use 6 classes(objects) and resolution 1088x1088: http://pastebin.com/GY9NPfmc

AlexeyAB on 1 Mar 2017

👍1

I have successfully trained YOLO on 544 * 544, the trick is that training images should be bigger than this size. it sacrifices the speed although, as YOLO authors mentioned on the FPS/mAP curve.

VanitarNordic on 1 Mar 2017

👍1

I tried to detect a relatively small object with 416*416 trained network and 640*480 video input, but the network can not detect it from far.

Could the reason is that because I have not included images that shows the object from far in the training/validation data-set?

VanitarNordic on 1 Mar 2017

@VanitarNordic You should not change aspect ratio, use for detection network size 608x608.

Object in training-dataset should have the same relative size in %, as in detection-dataset.

AlexeyAB on 1 Mar 2017

@AlexeyAB

No, I trained the model with 416*416 resolution

The input live video resolution is 640*480 for testing

detection-dataset you mean validation images which used in training process or you mean unseen images when we decide to test the model?

VanitarNordic on 1 Mar 2017

Detection-dataset is images or video on which you want to detect objects. Did you change network size?

What is average relative size of object was:

in training-dataset?
in detection-dataset?

Could the reason is that because I have not included images that shows the object from far in the training/validation data-set?

Yes.

AlexeyAB on 2 Mar 2017

Okay, I got it. Thanks.

Yes I changed the network size to 416*416 to make the speed test.

Yes, in the training data-set the object sizes are normal and are not from far, that's correct. in detection-dataset sometimes I was putting the object far from camera and it was unable to detect. I think (as you mentioned correctly) if I want to detect the object by its all scales and conditions, I should add training/validation images which cover these conditions.

VanitarNordic on 2 Mar 2017

Was this page helpful?

0 / 5 - 0 ratings