@VanitarNordic

Remove you fork of Darknet, and re-fork it with all last commits.

There was a bug of training:

Commits on Jan 10, 2017 @AlexeyAB
fixed bug: rand() for batches of images

Also it can be bad training dataset, or bad .cfg file.

AlexeyAB on 5 Feb 2017

👍1

Usually what is the condition of the training/validation images?
I mean how target objects should look like in training images? All of them should be 100% clear or if some of the target objects were partially covered with other non-target objects then it would be fine?

VanitarNordic on 5 Feb 2017

@VanitarNordic

The main condition, there should be no images in which the object is present, but it is not marked (labeled).
It is also desirable, there should not be images without any objects.
It is desirable that all of the images as much as possible different from each other.

AlexeyAB on 5 Feb 2017

👍2

Bingo, Thanks.

So by the way if we decided to detect a banana for example, then one good training image would be a banana in hand of a person, not a single banana in a white background, isn't it?

in my case with the above condition, the situation solved when I changed the height and width in the .cfg file from 448 * 448 to 416 * 416. Strange a bit.

VanitarNordic on 5 Feb 2017

@VanitarNordic

Yes, the different position of the object on a different background - a better solution.

AlexeyAB on 5 Feb 2017

Great. Thanks, I got my answer.

Please also reply my question on a competition results. I'm waiting for your reply for a long time (first open issue from bottom)

VanitarNordic on 5 Feb 2017

what is proble with this:
Region Avg IOU: 0.000145, Class: 1.000000, Obj: 0.026601, No Obj: 0.073908, Avg Recall: 0.000000, count: 12
Region Avg IOU: 0.016246, Class: 1.000000, Obj: 0.082520, No Obj: 0.084514, Avg Recall: 0.000000, count: 12
Region Avg IOU: 0.018083, Class: 1.000000, Obj: 0.083319, No Obj: 0.074546, Avg Recall: 0.000000, count: 12
217: 22914277376.000000, 2292594176.000000 avg, 0.005000 rate, 3.729000 seconds, 13888 images
Loaded: 0.000000 seconds
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.025444, Avg Recall: 0.000000, count: 13
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.025746, Avg Recall: 0.000000, count: 13
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.026183, Avg Recall: 0.000000, count: 13

OPPOA113 on 6 Mar 2017

😕1

@OPPOA113

-1.#IND00 appears sometimes or always?
What did you change in your .cfg-file?
Write full command that you used for training.

AlexeyAB on 6 Mar 2017

@AlexeyAB
1.start from number 217 to the end,and the loss was extremely large.22914277376.000000, 2292594176.000000 avg,
2.noly two place were changed for traing 1 classes.
learning_rate=0.0001
max_batches = 21000
policy=steps
steps=100,1000,1500,10000
scales=10,.1,.1,.1
.......
filters=30
activation=linear
[region]
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52
bias_match=1
classes=1
coords=4
......
3.traing cmd:
darknet.exe detector train ./data/voc.data yolo-voc.cfg darknet19_448.conv.23 > training1_2.log
What is the general cause of this situation?I did it all according to your instructions, but there are still some problems..

OPPOA113 on 6 Mar 2017

@OPPOA113

It is very strange, because if your init learning_rate=0.0001 and you set scales scales=10,.1,.1,.1 then learning rate can be 0.001, 0.0001, 0.00001, 0.000001.
But in your output, rate = 0.005:

217: 22914277376.000000, 2292594176.000000 avg, 0.005000 rate, 3.729000 seconds, 13888 images
Loaded: 0.000000 seconds

And yes, loss is extremely large 22914277376.000000

loss (error) for batch calculated here return (float)sum/(n*batch);: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L280
loss (error) for one image calculated here, only for last layer-30 if(net.layers[i].cost){: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L183

Did you build Yolo for GPU (i.e. training on GPU)?
Can you provide whole .cfg-file? by using: http://pastebin.com/index.php
Can you add this lines to the source code and repeat this error:

Here add: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L194

    printf("net.n = %d,  count = %d,  sum = %f  \n", net.n, count, sum);
    return sum/count;
}

Here add: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L280

    printf(" n = %d,  batch = %d,   sum = %f \n", n, batch, sum);
    return (float)sum/(n*batch);
}

AlexeyAB on 6 Mar 2017

@OPPOA113
Also, show by using: http://pastebin.com/index.php
both files:

./data/voc.data
yolo-voc.cfg

And did you have prepare a image dataset that has only a 1 class or more than 1 classes?

AlexeyAB on 6 Mar 2017

@AlexeyAB
Thank you very much for you answer. But the problem cannot be sloved.
The problem appears in the 718th iteration in the log file.
log：
Region Avg IOU: 0.123613, Class: 1.000000, Obj: 0.009103, No Obj: 0.017152, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 151.270279
Region Avg IOU: 0.141605, Class: 1.000000, Obj: 0.012890, No Obj: 0.021402, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 145.112289
Region Avg IOU: 0.159398, Class: 1.000000, Obj: 0.019039, No Obj: 0.023729, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 143.040283
Region Avg IOU: 0.134866, Class: 1.000000, Obj: 0.012214, No Obj: 0.021009, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 146.331787
Region Avg IOU: 0.156597, Class: 1.000000, Obj: 0.017798, No Obj: 0.023424, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 143.654922
Region Avg IOU: 0.125920, Class: 1.000000, Obj: 0.009804, No Obj: 0.017751, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 148.915298
Region Avg IOU: 0.136872, Class: 1.000000, Obj: 0.012135, No Obj: 0.020786, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 145.199615
Region Avg IOU: 0.146237, Class: 1.000000, Obj: 0.013695, No Obj: 0.021240, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 144.731476
n = 8, batch = 8, sum = 1168.255859
717: 18.253998, 253.565598 avg, 0.001000 rate, 3.728000 seconds, 45888 images
Loaded: 0.000000 seconds
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011391, No Obj: 0.010609, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011165, No Obj: 0.010613, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011104, No Obj: 0.010470, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011054, No Obj: 0.010635, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011532, No Obj: 0.010739, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011197, No Obj: 0.010464, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.010470, No Obj: 0.010762, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011000, No Obj: 0.010438, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
n = 8, batch = 8, sum = -1.#IND00
718: -1.#IND00, -1.#IND00 avg, 0.001000 rate, 3.885000 seconds, 45952 images
Loaded: 0.000000 seconds
D:\YOLO\ZEM_trainzem_data\Image\0949.jpg
D:\YOLO\ZEM_trainzem_data\Image\1266.jpg
D:\YOLO\ZEM_trainzem_data\Image\1869.jpg
D:\YOLO\ZEM_trainzem_data\Image\0882.jpg
D:\YOLO\ZEM_trainzem_data\Image\1844.jpg
.........
yolo-voc.cfg：
learning_rate=0.0001
max_batches = 41000
policy=steps
steps=100,25000,35000
scales=10,.1,.1,
...
pad=1
filters=30
activation=linear

[region]
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52
bias_match=1
classes=1
coords=4

voc.data：
classes= 1
train = D:/YOLO/ZEM_train/zem_data/voc_person_train.txt
valid = D:/YOLO/ZEM_train/zem_data/person_val.txt
names = D:/YOLO/ZEM_train/zem_data/pasacal.names
backup = D:/YOLO/ZEM_train/backup/

training cmd:
darknet.exe detector train D:/YOLO/ZEM_train/zem_data/voc.data D:/YOLO/ZEM_train/yolo-voc.cfg D:/YOLO/ZEM_train/darknet19_448.conv.23 > training2.log

testing cmd:
darknet.exe detector test D:/YOLO/ZEM_train/zem_data/voc.data D:/YOLO/ZEM_train/yolo-voc.cfg D:/YOLO/ZEM_train/backup/yolo-voc_600.weights -i 0 -thresh 0.058124

Nothing was detected......

OPPOA113 on 7 Mar 2017

There is a problem with your data-set.

Besides, threshold is a value between 0.1 and .9 (default 0.25)

rename your image containing root folder to obj (as mentioned in the description) and train again

VanitarNordic on 7 Mar 2017

@VanitarNordic @AlexeyAB
data-set?Which do you mean?
threshold between 0.1 and .9 have the same results.Nothing was detectbut......
My file structure is shown in Figure:

OPPOA113 on 7 Mar 2017

Follow the instruction of the author and put your images in a folder with the same name as he mentioned. change "zem_data" to "obj" and see the difference.

VanitarNordic on 7 Mar 2017

@OPPOA113

Try to update darknet from my fork, last commit fix some bug with rand() on Windows: https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548
Try to open your dataset by using Yolo-mark: https://github.com/AlexeyAB/Yolo_mark
Look at each image by clicking space-button, and at the end press ESC.
Are all bound boxes correct?
Try to uncomment this line printf("grp: %s\n", paths[index]);: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L56

    for(i = 0; i < n; ++i){     
        int index = rand()%m;       
        random_paths[i] = paths[index];
        //if(i == 0) printf("%s\n", paths[index]);
    printf("grp: %s\n", paths[index]);
    }

Try to train again, and put output when error occurs.

AlexeyAB on 7 Mar 2017

Also, you don't need to train all 45000 iterations to make some tests. train for 1000 iterations, interrupt it and just check if everything is correct by some detection, although with lower accuracy

VanitarNordic on 7 Mar 2017

@AlexeyAB @AlexeyAB
finally,it work. thank you.
I also have a quetion about small target detection box, there are some offset in the results. what is the main problem? And Where should I solve? adding more small targe training set, or adjust some parameter, even should I change network? the test results as shown below using the official model.
give me some advice please. thank you.

OPPOA113 on 8 Mar 2017

Try to change height=608 and width=608 in your cfg-file: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/cfg/yolo-voc.cfg
And use it with your the same trained weight-file for detection on the same image. Show here result image.

AlexeyAB on 8 Mar 2017

@AlexeyAB
Much better now.thank you.

OPPOA113 on 9 Mar 2017

@AlexeyAB

What makes a difference between running a training command by executing a .cmd file (for example train.cmd), or by writing the command inside console, or running the train.cmd inside the console?

When I execute the train.cmd file directly, it causes -nan problem between 300-400 iterations as the first picture in this issue, BUT when I start training by writing the command inside console, or executing the train.cmd inside console, it trains flawlessly till the end without any problem!

This is the command: darknet.exe detector train data/obj.data yolo-obj.cfg darknet19_448.conv.23

I have no idea!

VanitarNordic on 9 Mar 2017

@VanitarNordic There is no difference, there must be the same. Perhaps the wrong images-dataset, and the error occurs randomly.

AlexeyAB on 9 Mar 2017

@AlexeyAB

Not really. everything is correct with the data-set. if that's a wrong data-set, why I can train it very long till 6000 or more? only happens when I directly execute train.cmd. When it gonna happen, I see recall=0 in previous iterations before -nan happens. besides, data-set is really small to have unseen errors. it only contains 200 images. impossible.

VanitarNordic on 9 Mar 2017

@VanitarNordic This is very strange. Can you reproduce this fail many times?

May be some dll-files or other .data/.cfg/.weights-files taked from different paths when click train.cmd.

AlexeyAB on 9 Mar 2017

This is very strange. Can you reproduce this fail many times?

No I can not.
Now I tested again and it did not happen. as I said Dataset is quite small and I have annotated them by yolo_mark and I have tested it at least 20-30 times, but the only time it happened was when I run the train.cmd directly. It is impossible to have errors in dataset, and even if it has, I should not be able to train it for several thousand.

VanitarNordic on 9 Mar 2017

@VanitarNordic Also, rarely it (NAN) can be even for well-formed dataset, occasionally. I also met with such a mistake, but without any changes re-training ended well.
I think that since it does not reproduced, then it happened by accident, no matter how you started the training.

Did you update last commits, esepcially this fix - 2 days ago? https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548

There was a bug, when for single batch loaded the same images 8 times - this worsened the training and increased the likelihood of such fail.

AlexeyAB on 9 Mar 2017

👍1

Did you update last commits, esepcially this fix - 2 days ago?

No I have not.

There was a bug, when for single batch loaded the same images 8 times - this worsened the training and increased the likelihood of such fail.

Yes, that could be the reason of that.

One question, Do these bugs exist inside darknet Yolo itself, or they may exist because of the modifications in the windows distribution?

VanitarNordic on 9 Mar 2017

@VanitarNordic It was only in Windows-fork, because rand() function works differently than on Linux.

AlexeyAB on 9 Mar 2017

👍1

@AlexeyAB
Let me ask this question here.
I downloaded the updated repository and tried to build, but I get one error (also 1615 warnings):

LNK1104 cannot open file 'opencv_core249.lib

It is strange, because I can compile the older repository without any problem.

VanitarNordic on 9 Mar 2017

@VanitarNordic

As you see this line didn't change 3 months: https://github.com/AlexeyAB/darknet/blame/master/src/yolo.c#L9
And this 2 months: https://github.com/AlexeyAB/darknet/blame/master/build/darknet/darknet.vcxproj#L98

Do you use OpenCV 2.4.9 in directory C:\opencv_2.4.9?

Check all how to compile settings: https://github.com/AlexeyAB/darknet#how-to-compile

1 paths: C:\opencv_2.4.9\opencv\build\include & C:\opencv_2.4.9\opencv\build\x64\vc12\lib or vc14\lib

3.2 (right click on project) -> properties -> Linker -> General -> Additional Library Directories

AlexeyAB on 9 Mar 2017

3.2 (right click on project) -> properties -> Linker -> General -> Additional Library Directories

The problem was here and solved. I had to change vc14 to vc12. it seems you had modified it. Would you please share your vc14 within the repository?

VanitarNordic on 9 Mar 2017

@VanitarNordic Yes, thank you, correct line-148 in darknet.vcxporj, not line-94.
Fixed: https://github.com/AlexeyAB/darknet/blame/master/build/darknet/darknet.vcxproj#L148

AlexeyAB on 10 Mar 2017

@AlexeyAB

I tried the modified code and trained again (datasets is exactly the same). These are the results of training after 2000 iterations. Strange that the validation error got worse.

1) Old repository before recent update: Training=0.04 avg , validation=76%
2) New repository after the recent update: Training=0.003 avg , Validation = 67%

Validation error got worse for around 10% which is significant!

Strange.

VanitarNordic on 10 Mar 2017

@VanitarNordic detector recall don't show error, it shows precision. Do you mean IoU?
Learning avg error is small - it's good.

AlexeyAB on 10 Mar 2017

@AlexeyAB

Yes, I mean IOU, which has decreased about 10% and I see the difference when I test it on unseen pictures. definitely 76% is better than 67%.

VanitarNordic on 10 Mar 2017

@AlexeyAB

Identical dataset, 2000 iterations.

1) Old repository:

old

2) Updated repository:

new

VanitarNordic on 10 Mar 2017

@VanitarNordic It would be strange if the old bug would improve the learning curve :)
What steps, sacles, learning_rat and random parameters did you use?

Try to set steps=100,1200,1750 and train agai, what will be IoU? https://github.com/AlexeyAB/darknet/blob/master/cfg/yolo-voc.cfg#L17

AlexeyAB on 10 Mar 2017

Also, did you use cuDNN in both cases? (5.) https://github.com/AlexeyAB/darknet#how-to-compile
By default in new repository CUDNN is disabled.

AlexeyAB on 10 Mar 2017

@AlexeyAB

I used exactly the same cfg files. So everything is identical (including steps and so on). Yes, I added CUDNN in both cases and complied and generated Darknet.exe.

2017-03-10_14-09-27

I don't know why training error diminished but the precision decreased. Our goal should be increasing the IOU and Recall.

VanitarNordic on 10 Mar 2017

@VanitarNordic Yes, IoU and Recall should be increased.

Try to set float thresh = .2; in validate_detector_recall()-function and do detector recall again, without re-training: https://github.com/AlexeyAB/darknet/blob/47409529d0eb935fa7bafbe2b3484431117269f5/src/detector.c#L385

AlexeyAB on 10 Mar 2017

@VanitarNordic I'll test on my datasets, and if I will get the same results, then maybe I'll think about leaving that bug :)

AlexeyAB on 10 Mar 2017

@AlexeyAB

Yes, I had change another part, I found and modified and compiled again. This is the proof:

2017-03-10_16-08-40

But oh my goodness, now the IOU even decreased! RPs/Img has changed.

2017-03-10_16-07-01

VanitarNordic on 10 Mar 2017

@AlexeyAB

I did some tests again. I copied an old .weights from the old repository folder, and tested it with new repository compiled files. The results is good as the old one.

Therefore we can understand there is a problem in training with new repository and the related .weights file

VanitarNordic on 10 Mar 2017

@VanitarNordic Yes. Also, try validate with float thresh = .2; old weight, and show result: RPs, Iou and Recall.
I'm trying to figure out what's going on.

AlexeyAB on 10 Mar 2017

@AlexeyAB

I did. Old weights and using the new repository with float thresh = .2;

2017-03-10_17-25-56

VanitarNordic on 10 Mar 2017

@AlexeyAB

after many many experiments, I found a solution to this. Do you want that I update the code as a contributor or share it here?

The problem was your method to seed the srand

VanitarNordic on 12 Mar 2017

@VanitarNordic

As you prefer. Right here or create pull-request.
What are the details of the error?

AlexeyAB on 12 Mar 2017

@AlexeyAB

There is no error with your code. but I updated the method of seeding the srand.
Actually rand() function in C is not good as its successors in C++, as we should seed it with srand(). and time(0) is a good seed for that, but it can be improved a bit, but with a different method.

This article helped me to improve seeding: http://www.eternallyconfuzzled.com/arts/jsw_art_rand.aspx

But, if we don't use this extra function and just use srand(time(0)) will work, but using this function will make it better:

unsigned time_seed()
{
    time_t now = time(0);
    unsigned char *p = &now;
    unsigned seed = 0;

    for (size_t i = 0; i < sizeof now; i++)
    {
        seed = seed * (UCHAR_MAX + 2U) + p[i];
    }

    return seed;
}


char **get_random_paths(char **paths, int n, int m)
{   
    char **random_paths = calloc(n, sizeof(char*));
    int i;  
    pthread_mutex_lock(&mutex);
    srand(time_seed());
    //printf("n = %d \n", n);
    for (i = 0; i < n; ++i) {
        int index = rand() % m;
        random_paths[i] = paths[index];
        //if(i == 0) printf("%s\n", paths[index]);
        //printf("grp: %s\n", paths[index]);
    }
    pthread_mutex_unlock(&mutex);   
    return random_paths;    
}

VanitarNordic on 12 Mar 2017

@VanitarNordic

But it my commit-fix, I use seed(time(0)) only as initial, and for other time use srand(mt_seed); https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548#diff-2ceac7e68fdac00b370188285ab286f7R50

It is done because function char **get_random_paths(char **paths, int n, int m) called from 8 threads simultaneously: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L780

But time(0) returns time accurate to seconds http://en.cppreference.com/w/c/chrono/time

So if we use seed(time(0)) or any varian of seed(time_seed()) then all 8 threads have the same value, because executed in prarallel, then in 1 batch used at 8 times less images, but each image 8 times.

So your fix will use again seed(time(0)) as before my fix, but may be with better time-function time_seed(). Then again all 8 threads will have the same value, and in 1 batch will be used at 8 times less images, but each image 8 times: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L780

AlexeyAB on 12 Mar 2017

@AlexeyAB

Hmmm, things going to be complex. So because you update the seed before unlocking the mutex then we will get more random values?

I think maybe your updated seeds do not include all images randomly in a good way, that's might be the reason of that extremely low training error, but a 10% decrease in IOU.

After I trained with this new code, it increased the IOU. I got between 75% to 76%. So why? avg error is around 0.02.

Also I realized using rand() * rand() will increase the IOU as you had used before, I never understood why.

VanitarNordic on 13 Mar 2017

@VanitarNordic

I have not yet tested the distribution of random values. I'll check it out.

If it turns out to be not uniform, then I will correct it.
If it turns out to be uniform, then there are 2 options:
Because Images are fed more uniform, it may be necessary to train with a different number of iterations (more or less) to achive the same IoU
Or the option before the change - this is an accidentally found way to better train the network - unlikely

Try to train with my last code, but increase learning_rate at 10x in .cfg-file, and train only 200 iterations, what will IoU?

AlexeyAB on 13 Mar 2017

@AlexeyAB

Let me tell you something, although I get a better IOU, but still the old .weights file (before your new update) detects better with a higher confident! interesting isn't it? Both show almost the same IOU, but one detects better!

Okay, I will do that. is there anyway to store .weights file for each 100 iterations after we go above 1000?
(for example a weight in 1100 .. 1200 .. 1300)

VanitarNordic on 13 Mar 2017

@VanitarNordic But what do you mean about "Both show almost the same IOU"?
You said that old weights shows IoU 75.85% but new wights only 66.81% https://github.com/AlexeyAB/darknet/issues/19#issuecomment-285631285

Yes, try with this my commit, but with 10x more learning_rate, and only 200 iterations.

AlexeyAB on 13 Mar 2017

@AlexeyAB

I mean:
Old weight IOU (old repository before your recent update): 75%
New weight IOU (after your recent update): 66%
New weight IOU (with my own modifications as I mentioned above): 75%

But detection of the older one is better.

VanitarNordic on 13 Mar 2017

@AlexeyAB

I did your suggestion now. few iterations after 100, avg errors went up and finally it went to -nand situation

VanitarNordic on 13 Mar 2017

@AlexeyAB

Also I did more iterations to 3000 and 4000 after your new update before, but IOU did not rise above 66%.

VanitarNordic on 13 Mar 2017

@AlexeyAB

Alright, I tested another solution and now works as good as past or even better. The trick is that I used GetTickCount()function which handles a resolution in millisecond:

char **get_random_paths(char **paths, int n, int m)
{   
    char **random_paths = calloc(n, sizeof(char*));
    int i;  
    pthread_mutex_lock(&mutex);
    srand(GetTickCount());
    //printf("n = %d \n", n);
    for (i = 0; i < n; ++i) {
        int index = rand() % m;
        random_paths[i] = paths[index];
        //if(i == 0) printf("%s\n", paths[index]);
        //printf("grp: %s\n", paths[index]);
    }
    pthread_mutex_unlock(&mutex);   
    return random_paths;    
}

Results:

2017-03-13_11-20-39

VanitarNordic on 13 Mar 2017

@AlexeyAB

I finally reached to a solution by using the rand_s function. The rand_s is a Windows compatible function which has been added to the VS since the VS 2005. it generates random number within unsigned int range. because rand() % m actually generates random numbers between 0 and m-1, then we have to modify it to produce numbers between 0 and m-1.

Here is the code. You can verify the random generation by yourself.

inline int randomGen(int nMin, int nMax) {
    unsigned int Num = 0;
    rand_s(&Num);
    unsigned int n_s = nMin + Num % (nMax + 1 - nMin);
    return n_s;
}


char **get_random_paths(char **paths, int n, int m)
{
    char **random_paths = calloc(n, sizeof(char*));
    int i;  
    pthread_mutex_lock(&mutex);

    for (i = 0; i < n; ++i) {
        int index = randomGen(0, m-1);
        //printf("index = %d \n", index);
        random_paths[i] = paths[index];     
    }
    pthread_mutex_unlock(&mutex);
    return random_paths;
}

VanitarNordic on 14 Mar 2017

👍1

@VanitarNordic Thank you! This looks correct. What avg_loss, IoU & Recall did you achive with your last randomGen() function?

AlexeyAB on 14 Mar 2017

👍1

@AlexeyAB
You're welcome. I learned many thing of you, this was the smallest thing I could add. I experimented too much. during training from the count parameter is also obvious that numbers are correctly randomized.

Here is the results, During experiments with this new code, IOU was around 74% to 76% and Recall was around 88% to 90%.
The detection is also good.

2017-03-14_16-04-48

VanitarNordic on 14 Mar 2017

@AlexeyAB

Also, if there is another critical area inside the code, which we could replace with this new random function, just let me know, to modify and make another test. Actually I replaced all rand() functions inside data.c file now and result is Okay.

VanitarNordic on 14 Mar 2017

👍1

@VanitarNordic

I fixed all rand() to rangom_gen() which implemented by using rand_s() in last commit: https://github.com/AlexeyAB/darknet/commit/a71bdd7a83e33f28d91b88551b291627728ee3e7

I tested 3 cases and got distributions of random values:

Before my previous fix (left-side): https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548
After my previous fix (right-side): https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548
After the corrections that you proposed (right-side): https://github.com/AlexeyAB/darknet/commit/a71bdd7a83e33f28d91b88551b291627728ee3e7

Your offer (3) with rand_s() is the closest to the normal distribution - and has the lowest value of standard deviation from average value:

standard_deviation

Also rand_s() has only 4 unused images of 7587 images after 1000 iterations - that good:

distribution_1000

And rand_s() uses different values in different threads - that good:

distribution_1

AlexeyAB on 14 Mar 2017

❤2

@AlexeyAB

I really loved your scientific analysis. I didn't know about these calculations.
There are some other areas that rand() have been used, I mean in other files rather than just data.c , do you think it is a good idea to replace those also and make another test?

VanitarNordic on 14 Mar 2017

@VanitarNordic

Yes, you can try to replace other rand() to random_gen() in other files. For example in: detector.c, detection_layer.c, network.c, utils.c, crop_layer.c, gemm.c, matrix.c and re-train Yolo.

Perhaps the changes in these files, especially in files detection_layer.c, network.c, crop_layer.c - can improve training.

AlexeyAB on 14 Mar 2017

@AlexeyAB

I did but it makes the Darknet.exe to crash when I start training. I think we should be careful about the range also. rand_s provides random numbers within unsigned int range. Also, I just replaced them all by find-replace. maybe for some code areas, it will not work just by doing this.

VanitarNordic on 14 Mar 2017

@VanitarNordic

Did you use exactly this implementation of unsigned int random_gen() or what that other?

inline unsigned int random_gen()
{
    unsigned int Num = 0;
    rand_s(&Num);
    return Num;
}

AlexeyAB on 14 Mar 2017

@AlexeyAB

I did with your modified code. I downloaded the new repository and tried to replace all. You can try yourself.

VanitarNordic on 14 Mar 2017

@AlexeyAB

Alright, I investigated in the code and I found where this crash comes from. I replaced rand() with random_gen() in files one by one and finally the problem rises from a function inside utils.c:

float rand_uniform(float min, float max)
{
    if(max < min){
        float swap = min;
        min = max;
        max = swap;
    }
    return ((float)rand()/RAND_MAX * (max - min)) + min;
}

if we replace this rand() with random_gen(), then Darknet.exe will crash. What do you think?

VanitarNordic on 15 Mar 2017

@AlexeyAB

I did not touch that function which was causing the crash, modified the rest and re-trained the model again (2000 iterations). The results got better. Recall was also improved about 4% to 6%.
Also I realized the the detection speed got slightly faster for around 2FPS or more. Also I realized that the detection of small objects has significantly improved even on 416*416 network size. The input test video resolution was 1024*768. if you could solve that issue just let me know to change and re-test.

2017-03-15_11-36-41

VanitarNordic on 15 Mar 2017

👍1

I got the same problem here while training my own data on CPU. I trained at least seven times, all end up failure, with a lot of nan, -nan(ind).
Also, I followed all the steps in README, and I used the last committed version.

At first I thought there was a "divide by zero" error. I added some code like if(count==0)count=1; before avg_iou/count, this didn't help.
Debugging is extremely slow, It may take a very very long time to get this error.

After several failures, I found a pattern where the first nan usually occurs when the value of count suddenly becomes larger. So I deleted some images with many annotations and overlapping annotations. But that's still not helpful.
My images are from the INRIA person database. I use Yolo_mark to get the annotations.

My config:
person.data
```classes= 1
train = data/train.txt
valid = data/test.txt
names = data/person.names
backup = backup/

**person.names**

person

**yolo-person.cfg**

[net]

Testing

batch=1
subdivisions=1

Training

batch=64

subdivisions=8

...

[convolutional]
size=1
stride=1
pad=1
filters=30
activation=linear

[region]
anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071
bias_match=1
classes=1
coords=4
num=5
softmax=1
jitter=.3
rescore=1
...

**train_person.cmd**

darknet_no_gpu.exe detector train data/person.data yolo-person.cfg darknet19_448.conv.23
```

This problem drives me crazy. Please help me out.

chasonlee on 21 Apr 2017

@chasonlee Hi,

How many iterations did you do maximum?
There can be bugs in training Yolo on CPU. I think very few people try to train on CPU, because it requires many years.
What batch and subdivision number did you set? https://github.com/AlexeyAB/darknet/blob/85d1416ff06846f11ad943e86d42b1f16bf36518/cfg/yolo-voc.cfg#L6
Try to set batch=64 and subdivisions=4

AlexeyAB on 21 Apr 2017

oh man forget about CPU, otherwise you plan to put your lifetime on!

VanitarNordic on 21 Apr 2017

👍2

@AlexeyAB Hi, thanks for the quick response.

The maximum of iterations is less then 200 (Took me a few hours...). Randomly from about 30 to 170.
I didn't change that, it's batch=1 and subdivisions=1, I might try that later.

CPU is too slow, I will find a computer with a GPU to try again.

chasonlee on 21 Apr 2017

@VanitarNordic That's true. I totally agree...

chasonlee on 21 Apr 2017

Darknet: The reason of -nan, -nan(ind)

Most helpful comment

All 74 comments

Testing

Training

batch=64

subdivisions=8

Related issues