Darkflow: convergence is not decreasing after 3

Created on 14 Sep 2018 · 18Comments · Source: thtrieu/darkflow

i have a data-set of around 7000 images having 12 classes.... upon training the loss gets decrease to around 3 which was achieved at 1500 iterations, after that till 3500 iterations there wasn't any remarkable change in it... can anyone help me out with this problem?

Source

hashirali2604

Most helpful comment

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

rockstardotb on 21 Jun 2019

👍3

All 18 comments

@hashirali2604, you can try the following:

Start with a higher learning rate than usual, eg. 1e-03 or 1e-04,
Stop the training at the point where the loss line is flatting
Start the training again from the last checkpoint with a lower learning rate (e.g. 1e-05).

Remember to use --summary parameter when starting the training to have logs for Tensorboard, to monitor the loss curve.

In my case, this caused a really nice drop of the loss function.

k-lyda on 14 Sep 2018

👍1

@k-lyda can you guide me exactly from where i can change that learning rate? as i think it should be change from .cfg file of the model but upon changing the learning rate from there it still shows 1e-05 while training...

hashirali2604 on 14 Sep 2018

@hashirali2604 there is a parameter lr - you can pass it either via cli (e.g. --lr 0.0001) or in python code. All possible variables are listed here:
https://github.com/thtrieu/darkflow/blob/master/darkflow/defaults.py

k-lyda on 14 Sep 2018

@k-lyda thanks for the help... <3 I will update you soon

hashirali2604 on 14 Sep 2018

@k-lyda hey man... I have changed my dataset, as well as annotations and now my dataset is more than 16,000 having 11 classes but now at a learning rate of 1e-03 my loss and ave loss is getting as low as power -8 in just 1 epoch..... and after evaluating the model it doesn't show any detection even on 0.01 threshold... what do i have to do?
also right now i started training on 1e-05 and now after 1200 iterations its ave loss is 0.0098..

hashirali2604 on 22 Sep 2018

@hashirali2604, one question, have you modified the cfg file that in the last layer you have proper value for filters number? If you have 11 classes, it should be num*(11+5) - num is set in last section, [region]

k-lyda on 23 Sep 2018

@k-lyda, yes I have modified that and it is 5(11+5)=80 filters in my case, as the num is 5 by default

hashirali2604 on 23 Sep 2018

hi,
i am using tiny-yolo-voc.cfg and tiny-yolo-voc.weights. i am using 100 images and each image have 39 small objects (4088 instances). i have created XML annotations from these 100 images and changed the number of classes to 1 as there is only object in all these images and changed the number of filters to 30 in the last [convolutional] layer.
i tried to change the learning rate and batch size but the moving avg loss and loss is not decreasing even after 1500 epochs.
can you please suggest something to help me decrease the loss...?
thanks in advance...

The config file:
[net]
batch=64
subdivisions=8
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
max_batches = 40100
policy=steps
steps=-1,100,20000,30000
scales=.1,10,.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

#

[convolutional]
batch_normalize=1
size=3
stride=1
pad=1
filters=1024
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=30
activation=linear

[region]
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52
bias_match=1
classes=1
coords=4
num=5
softmax=1
jitter=.2
rescore=1

object_scale=5
noobject_scale=1
class_scale=1
coord_scale=1

absolute=1
thresh = .2
random=1

Command:
python flow --model cfg/tiny-yolo-voc-1c.cfg --load bin/tiny-yolo-voc.weights
--train --annotation annotations --dataset images --gpu 1 --epoch 300

I will really appreciate your quick help because i have tried everything from the last 1 week and its not working .....

BlackCode101 on 26 Nov 2018

@BlackCode101 ever get an answer?

mpky on 16 Mar 2019

@BlackCode101 can you share your solution if you solved it?

DomagojJaksic on 31 Mar 2019

@DomagojJaksic increasing the learning and massively increasing the number of epochs got my learning rate to decrease. However, when I went to use my new weights, they weren't detecting anything, so am going to attempt to re-run when I have the time. I thought I also saw somewhere that others were having similar issues with the tiny-yolo weights, but alas...

mpky on 31 Mar 2019

rockstardotb on 21 Jun 2019

👍3

@rockstardotb This looks like a very good idea

anantguptadbl on 27 Oct 2019

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

I am training a ball detector on yolo v2. I have done the first step of overtraining the model on a small subset, and I am now training on my full dataset of 700 pictures. The problem is that my loss is not going down significantly. It started out as 2.5 and is now fluctuating between 0.3 and 1.5 after about 65 epochs. How much epochs did it take for you to get a decent result?

SW0BBR on 15 Dec 2019

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

I am training a ball detector on yolo v2. I have done the first step of overtraining the model on a small subset, and I am now training on my full dataset of 700 pictures. The problem is that my loss is not going down significantly. It started out as 2.5 and is now fluctuating between 0.3 and 1.5 after about 65 epochs. How much epochs did it take for you to get a decent result?

What is your batch size and learning rate? Should be 16 and 1E-4, respectively. If it’s loss is now fluctuating between 0.3 and 1.5, you may need to stop and restart training from the last checkpoint with a lower learning rate (I.e., 1E-5). It’s been over a year since I worked on this particular project. I trained on a single class and, if I remember correctly, I had a good model somewhere between 800 - 1200 epochs. Note, my dataset was very large, approximately 12000 images. One last question, did you change the .cfg file to make the last layer a single class?

rockstardotb on 15 Dec 2019

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

I am training a ball detector on yolo v2. I have done the first step of overtraining the model on a small subset, and I am now training on my full dataset of 700 pictures. The problem is that my loss is not going down significantly. It started out as 2.5 and is now fluctuating between 0.3 and 1.5 after about 65 epochs. How much epochs did it take for you to get a decent result?

What is your batch size and learning rate? Should be 16 and 1E-4, respectively. If it’s loss is now fluctuating between 0.3 and 1.5, you may need to stop and restart training from the last checkpoint with a lower learning rate (I.e., 1E-5). It’s been over a year since I worked on this particular project. I trained on a single class and, if I remember correctly, I had a good model somewhere between 800 - 1200 epochs. Note, my dataset was very large, approximately 12000 images. One last question, did you change the .cfg file to make the last layer a single class?

My batch size is 8 and my learning rate is 1e-05, i'll change it and see if that makes any difference. And yes i have changed my cfg to work for one class. Ill keep you updated, thanks for the advice!

SW0BBR on 15 Dec 2019

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

I am training a ball detector on yolo v2. I have done the first step of overtraining the model on a small subset, and I am now training on my full dataset of 700 pictures. The problem is that my loss is not going down significantly. It started out as 2.5 and is now fluctuating between 0.3 and 1.5 after about 65 epochs. How much epochs did it take for you to get a decent result?

What is your batch size and learning rate? Should be 16 and 1E-4, respectively. If it’s loss is now fluctuating between 0.3 and 1.5, you may need to stop and restart training from the last checkpoint with a lower learning rate (I.e., 1E-5). It’s been over a year since I worked on this particular project. I trained on a single class and, if I remember correctly, I had a good model somewhere between 800 - 1200 epochs. Note, my dataset was very large, approximately 12000 images. One last question, did you change the .cfg file to make the last layer a single class?

Edited the learning rate and batch size. Now i run into the problem that my loss turns into NaN after a couple of steps. Any ideas?

SW0BBR on 15 Dec 2019

@mpky I ran into a similar issue but resolved it by first overfitting on a subset of images (3-5, or the minimum number to get an instance of all classes). After 1000 - 2000 epochs on five images, the net correctly detected the objects with ~0.9 confidence. After getting that confidence, I began training on the entire dataset and it is working like a charm. hope this helps!

I am training a ball detector on yolo v2. I have done the first step of overtraining the model on a small subset, and I am now training on my full dataset of 700 pictures. The problem is that my loss is not going down significantly. It started out as 2.5 and is now fluctuating between 0.3 and 1.5 after about 65 epochs. How much epochs did it take for you to get a decent result?

What is your batch size and learning rate? Should be 16 and 1E-4, respectively. If it’s loss is now fluctuating between 0.3 and 1.5, you may need to stop and restart training from the last checkpoint with a lower learning rate (I.e., 1E-5). It’s been over a year since I worked on this particular project. I trained on a single class and, if I remember correctly, I had a good model somewhere between 800 - 1200 epochs. Note, my dataset was very large, approximately 12000 images. One last question, did you change the .cfg file to make the last layer a single class?

Edited the learning rate and batch size. Now i run into the problem that my loss turns into NaN after a couple of steps. Any ideas?

Sounds like the learning rate is still too high. I’d try making it smaller. You may try restarting from the checkpoint where you overfitted and use a learning rate of 1E-5 instead of 1E-4. Keep batch size at 16.

rockstardotb on 15 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings