I'm using a subset of COCO dataset (around 35-40k images) to train 8 classes with Yolov2. I've modified the cfg file to be
learning_rate=0.0001 # I've also tried with 0.001
[region]
classes=8
[convolutional]
filters=65 # (8+5)*5
But even after 13000 iterations, the average loss value is not going below 1.
Region Avg IOU: 0.567968, Class: 0.892152, Obj: 0.360092, No Obj: 0.008364, Avg Recall: 0.750000, count: 12
13370: 0.955227, 1.153029 avg, 0.000100 rate, 3.976288 seconds, 855680 images
Resizing
384 x 384
try to allocate workspace = 10616832 * sizeof(float), CUDA allocate done!
Loaded: 0.000045 seconds
Region Avg IOU: 0.554906, Class: 0.751182, Obj: 0.234398, No Obj: 0.009464, Avg Recall: 0.656250, count: 32
Region Avg IOU: 0.618055, Class: 0.982063, Obj: 0.314850, No Obj: 0.009789, Avg Recall: 0.736842, count: 19
Region Avg IOU: 0.629569, Class: 0.921974, Obj: 0.383278, No Obj: 0.009722, Avg Recall: 0.769231, count: 13
Region Avg IOU: 0.701359, Class: 0.939804, Obj: 0.623751, No Obj: 0.008059, Avg Recall: 0.900000, count: 10
Region Avg IOU: 0.565361, Class: 0.939754, Obj: 0.294956, No Obj: 0.011120, Avg Recall: 0.677419, count: 31
Region Avg IOU: 0.670586, Class: 0.980710, Obj: 0.372620, No Obj: 0.006666, Avg Recall: 0.900000, count: 10
Region Avg IOU: 0.704796, Class: 0.967884, Obj: 0.418710, No Obj: 0.007288, Avg Recall: 0.857143, count: 14
Region Avg IOU: 0.546709, Class: 0.989108, Obj: 0.359880, No Obj: 0.009863, Avg Recall: 0.631579, count: 19
13371: 0.613410, 1.099067 avg, 0.000100 rate, 4.825989 seconds, 855744 images
Loaded: 0.000050 seconds
Region Avg IOU: 0.408547, Class: 0.902766, Obj: 0.239194, No Obj: 0.008172, Avg Recall: 0.465116, count: 43
Region Avg IOU: 0.681122, Class: 0.954415, Obj: 0.351681, No Obj: 0.011090, Avg Recall: 0.900000, count: 30
Region Avg IOU: 0.664868, Class: 0.993532, Obj: 0.382928, No Obj: 0.009634, Avg Recall: 0.842105, count: 19
Region Avg IOU: 0.482414, Class: 0.941180, Obj: 0.314420, No Obj: 0.007617, Avg Recall: 0.583333, count: 24
Region Avg IOU: 0.563489, Class: 0.930170, Obj: 0.346102, No Obj: 0.009122, Avg Recall: 0.600000, count: 20
Region Avg IOU: 0.446973, Class: 0.919570, Obj: 0.192612, No Obj: 0.010219, Avg Recall: 0.533333, count: 45
Region Avg IOU: 0.693021, Class: 0.951645, Obj: 0.363812, No Obj: 0.006403, Avg Recall: 0.900000, count: 10
Region Avg IOU: 0.456339, Class: 0.947699, Obj: 0.347462, No Obj: 0.008376, Avg Recall: 0.481481, count: 27
13372: 1.040705, 1.093231 avg, 0.000100 rate, 5.179664 seconds, 855808 images
Loaded: 0.000061 seconds
Is there something wrong here or is it the expected behaviour? Should I keep on training in hope that it will go down?
Your learning rate is too small.
Use this cfg-file without any changes: https://github.com/AlexeyAB/darknet/blob/6390a5a2ab61a0bdf6f1a9a6b4a739c16b36e0d7/cfg/yolov2.cfg
It should be trained about 500 000 iterations, so you should wait.
@AlexeyAB I should change the classes number and filters numbers accordingly right?
I have 8 classes, so filters = 65
Yes.
@AlexeyAB One more thing, since I've already run around 15,000 iterations with learning rate 0.0001, can I use the weights from the last iteration and train it with the cfg file you linked to above?
Or do I need to run it from the beginning?
Thanks
Better to train from the begining.