Darknet: It seems training will not converge around optimum value.

Created on 30 Nov 2019  路  4Comments  路  Source: AlexeyAB/darknet

Hi,

I am training yolov3 with configuration below on berkeley deep drive 100k dataset for 10 classes. Training seems and feels fine to me but I am suspicous anyway. For training yolov3, I have used darknet53.conv74 weight as pretrained weight because yolo.weights was obtained from training on COCO. I may be wrong in choosing the correct pretrained weight for transfer learning on bdd100k dataset. Anyway, the point is that loss is decreasing but slowly. Since I have 10 classes, I followed the instruction of @AlexeyAB and specified max_batches as classes*2000 so that it is 20000. I am not sure but It feels like this training will not converge around 0.0xxx at 20000 epochs. Because it is around 7 at 11000 epochs for now. Is it normal or should I modify another settings? For example, I did not recalculate anchor boxes for new dataset, in this case it is bdd100k dataset. Also how can I calculate mAP on bdd100k validation set ? Is it okay to use mAP calculation method on PASCAL dataset which is mentioned here. All in all, what should I do for obtaining the best mAP value as much as possible? I can share weights from my training or any other information.

[net]
# Testing
#batch=64
#subdivisions=1
# Training
batch=64
subdivisions=32
width=608
height=608
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001 
burn_in=1000
max_batches = 20000
policy=steps
steps=5000,12000
scales=.1,.1

Most helpful comment

I have used darknet53.conv74 weight

That's right.

Train the same number of iterations as many trainin images do you have. Do you have 100 000 images, so train 100 000 iterations. Change max_batches and steps.

What script did you use for converting labels to Yolo format?
Run training with -show_imgs flag and do you see correct bboxes?

Every dataset use their own mAP calculation approach.

But you can use ./darknet detector map ... for the mAP calculation, if you have Validation/Test dataset in Yolo-fromat: https://github.com/AlexeyAB/darknet#when-should-i-stop-training

All 4 comments

I have used darknet53.conv74 weight

That's right.

Train the same number of iterations as many trainin images do you have. Do you have 100 000 images, so train 100 000 iterations. Change max_batches and steps.

What script did you use for converting labels to Yolo format?
Run training with -show_imgs flag and do you see correct bboxes?

Every dataset use their own mAP calculation approach.

But you can use ./darknet detector map ... for the mAP calculation, if you have Validation/Test dataset in Yolo-fromat: https://github.com/AlexeyAB/darknet#when-should-i-stop-training

Train the same number of iterations as many trainin images do you have. Do you have 100 000 images, so train 100 000 iterations. Change max_batches and steps.

Okay, I will increase iterations. But what about steps and learning rate. Should I add more steps to decrease learning rate more because training is continuing? For example:

learning_rate=0.001 
burn_in=1000
max_batches = 100000
policy=steps
steps=20000,40000,60000,80000
scales=.1,.1,.1,.1

What script did you use for converting labels to Yolo format?

I have used this repo with some modifications for converting to yolo format..

Run training with -show_imgs flag and do you see correct bboxes?

I am training on AWS instance so that I cannot see images with -show_imgs. Nevertheless, I am running saved weights (i.e. yolov3-bdd100k_6000.weights) and it detects objects fine. There are some problems with traffic light and signs. I think that's because training is not finished yet.

Screenshot from 2019-12-01 13-04-01

By the way, I cannot see mAP and loss chart on port:8090 regarding ./darknet ..... -dont_show -mjpeg_port 8090 -map. What can be reason for that?

I did not see this command in your repo but thank you. ./darknet detector map..
I have also converted validation and test images to yolo format. Should I use both at the same time or one of them is enough?

Should I add more steps to decrease learning rate more because training is continuing?

No.
https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

change line steps to 80% and 90% of max_batches


By the way, I cannot see mAP and loss chart on port:8090 regarding ./darknet ..... -dont_show -mjpeg_port 8090 -map. What can be reason for that?

You should open port 8090 on Amazon EC2. By default ports are disabled on EC2.

By the way, if you compiled Darknet with OpenCV and run it with flag -dont_show - then Darknet wil create chart.png file for each 100 iteration, with Loss & mAP.


I have also converted validation and test images to yolo format. Should I use both at the same time or one of them is enough?

One of them is enough.

PS Test images make sense only if they are private - then the customer can check if you faked the results of your testing (if you trained the model on validation data).

Thanks a lot for your support and guidance

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Greta-A picture Greta-A  路  3Comments

bit-scientist picture bit-scientist  路  3Comments

HilmiK picture HilmiK  路  3Comments

Jacky3213 picture Jacky3213  路  3Comments

zihaozhang9 picture zihaozhang9  路  3Comments