Darknet: Learning rate at zero from the first iteration

Created on 22 Jun 2018  Â·  7Comments  Â·  Source: AlexeyAB/darknet

I am training tiny-yolo to detect custom objects following the instructions given in the README. Nevertheless from the very first iteration I get that the learning rate is equal to zero:

Region 16 Avg IOU: 0.391107, Class: 0.504761, Obj: 0.568887, No Obj: 0.499483, .5R: 0.253968, .75R: 0.047619,  count: 63
Region 23 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.514100, .5R: nan, .75R: nan,  count: 0
Region 16 Avg IOU: 0.328728, Class: 0.514288, Obj: 0.592841, No Obj: 0.498886, .5R: 0.086207, .75R: 0.000000,  count: 58
Region 23 Avg IOU: nan, Class: nan, Obj: nan, No Obj: 0.514350, .5R: nan, .75R: nan,  count: 0

 1: 412.725830, 412.725830 avg loss, 0.000000 rate, 548.818333 seconds, 128 images
Loaded: 0.001458 seconds

In my cfg file the setup is the following:

batch=128
subdivisions=2
width=480
height=480
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

To train it I have typed in the terminal:
./darknet detector train data/obj.data yolov3-tiny_obj.cfg yolov3-tiny.conv.15

Shouldn't the learning rate start to decay later on?

Most helpful comment

@elenina5 @Eyshika @git-sohib

new_weight = existing_weight — learning_rate * gradient = existing_weight - update_weights from this article: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

For each iteration update_weights = gradient * learning_rate

Out goal is - to find the closes local minimum as fast as possible, because we think that this is the optimal weights and we want to accelerate training. - It's just our assumption.

The advantage of burn_in in that, when you start your training, the local minimum (optimal weights) for your dataset can be very close but slightly differ than pre-trained weights that you use, so gradient != 0. And if we will use too high learning_rate=0.001 then we can change the weights very greatly so we can go far from this local minimum (optimal weights) and lose it.

Therefore we start move it very slowly with learning rate = 0.00000...01 consequently update_weights=0.00000...01 != 0 and increase learning rate for each step. So when weights will achive the optimal state (local minimum), then will be gradient ~= 0 and update_weights = gradient * learning_rate = 0 * learning_rate = ~0 for any learning_rate. And then weights will not be changed even if the learning rate will be increased up to 0.001. Because when the local minimum (optimal weights) is achived, the weights will not be changed even if we use high learining rate, if there are not more optimal weights in the vicinity [-0.001, +0.001].

All 7 comments

Thank you! But may I ask for a link where there is explained why it is needed?
Another question: is there any way of plotting the validation loss (or simply to print it on the terminal) during training?

@AlexeyAB What is burn_in here ? Also the policy, steps ? Please explain

@Eyshika , I found this on another post.
Burn-in lowers the learning rate at the beginning of the training (until step 1000 or what is specified in this property). Instead of the given learning rate (lr), it is lr*(batch_num/burn_in)^4 , so it starts low and increases to the specified lr at the end of the burn in period.

@git-sohib thank you for this answer. Can you also explain us what is the advantage of doing so?

@elenina5 refer to this post

@elenina5 @Eyshika @git-sohib

new_weight = existing_weight — learning_rate * gradient = existing_weight - update_weights from this article: https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10

For each iteration update_weights = gradient * learning_rate

Out goal is - to find the closes local minimum as fast as possible, because we think that this is the optimal weights and we want to accelerate training. - It's just our assumption.

The advantage of burn_in in that, when you start your training, the local minimum (optimal weights) for your dataset can be very close but slightly differ than pre-trained weights that you use, so gradient != 0. And if we will use too high learning_rate=0.001 then we can change the weights very greatly so we can go far from this local minimum (optimal weights) and lose it.

Therefore we start move it very slowly with learning rate = 0.00000...01 consequently update_weights=0.00000...01 != 0 and increase learning rate for each step. So when weights will achive the optimal state (local minimum), then will be gradient ~= 0 and update_weights = gradient * learning_rate = 0 * learning_rate = ~0 for any learning_rate. And then weights will not be changed even if the learning rate will be increased up to 0.001. Because when the local minimum (optimal weights) is achived, the weights will not be changed even if we use high learining rate, if there are not more optimal weights in the vicinity [-0.001, +0.001].

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Yumin-Sun-00 picture Yumin-Sun-00  Â·  3Comments

yongcong1415 picture yongcong1415  Â·  3Comments

rezaabdullah picture rezaabdullah  Â·  3Comments

Greta-A picture Greta-A  Â·  3Comments

off99555 picture off99555  Â·  3Comments