I am training tiny-yolo to detect custom objects following the instructions given in the README. Nevertheless from the very first iteration I get that the learning rate is equal to zero:
Region 16 Avg IOU: 0.391107, Class: 0.504761, Obj: 0.568887, No Obj: 0.499483, .5R: 0.253968, .75R: 0.047619, count: 63
Region 23 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.514100, .5R: nan, .75R: nan, count: 0
Region 16 Avg IOU: 0.328728, Class: 0.514288, Obj: 0.592841, No Obj: 0.498886, .5R: 0.086207, .75R: 0.000000, count: 58
Region 23 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.514350, .5R: nan, .75R: nan, count: 0
1: 412.725830, 412.725830 avg loss, 0.000000 rate, 548.818333 seconds, 128 images
Loaded: 0.001458 seconds
In my cfg file the setup is the following:
batch=128
subdivisions=2
width=480
height=480
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1
To train it I have typed in the terminal:
./darknet detector train data/obj.data yolov3-tiny_obj.cfg yolov3-tiny.conv.15
Shouldn't the learning rate start to decay later on?
This is normal.
That is how burn_in=1000 works: https://github.com/AlexeyAB/darknet/blob/026a679dedbc45486a11ecabf56c898372d8cac5/src/network.c#L94
Thank you! But may I ask for a link where there is explained why it is needed?
Another question: is there any way of plotting the validation loss (or simply to print it on the terminal) during training?
@AlexeyAB What is burn_in here ? Also the policy, steps ? Please explain
@Eyshika , I found this on another post.
Burn-in lowers the learning rate at the beginning of the training (until step 1000 or what is specified in this property). Instead of the given learning rate (lr), it is lr*(batch_num/burn_in)^4 , so it starts low and increases to the specified lr at the end of the burn in period.
@git-sohib thank you for this answer. Can you also explain us what is the advantage of doing so?
@elenina5 refer to this post
@elenina5 @Eyshika @git-sohib
new_weight = existing_weight — learning_rate * gradient = existing_weight - update_weights from this article: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10
For each iteration update_weights = gradient * learning_rate
Out goal is - to find the closes local minimum as fast as possible, because we think that this is the optimal weights and we want to accelerate training. - It's just our assumption.
The advantage of burn_in in that, when you start your training, the local minimum (optimal weights) for your dataset can be very close but slightly differ than pre-trained weights that you use, so gradient != 0. And if we will use too high learning_rate=0.001 then we can change the weights very greatly so we can go far from this local minimum (optimal weights) and lose it.
Therefore we start move it very slowly with learning rate = 0.00000...01 consequently update_weights=0.00000...01 != 0 and increase learning rate for each step. So when weights will achive the optimal state (local minimum), then will be gradient ~= 0 and update_weights = gradient * learning_rate = 0 * learning_rate = ~0 for any learning_rate. And then weights will not be changed even if the learning rate will be increased up to 0.001. Because when the local minimum (optimal weights) is achived, the weights will not be changed even if we use high learining rate, if there are not more optimal weights in the vicinity [-0.001, +0.001].
Most helpful comment
@elenina5 @Eyshika @git-sohib
new_weight = existing_weight — learning_rate * gradient = existing_weight - update_weightsfrom this article: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10For each iteration
update_weights = gradient * learning_rateOut goal is - to find the closes local minimum as fast as possible, because we think that this is the optimal weights and we want to accelerate training. - It's just our assumption.
The advantage of burn_in in that, when you start your training, the local minimum (optimal weights) for your dataset can be very close but slightly differ than pre-trained weights that you use, so
gradient != 0. And if we will use too highlearning_rate=0.001then we can change the weights very greatly so we can go far from this local minimum (optimal weights) and lose it.Therefore we start move it very slowly with
learning rate = 0.00000...01consequentlyupdate_weights=0.00000...01 != 0and increase learning rate for each step. So when weights will achive the optimal state (local minimum), then will begradient ~= 0andupdate_weights = gradient * learning_rate = 0 * learning_rate = ~0for any learning_rate. And then weights will not be changed even if the learning rate will be increased up to 0.001. Because when the local minimum (optimal weights) is achived, the weights will not be changed even if we use high learining rate, if there are not more optimal weights in the vicinity [-0.001, +0.001].