Hello,
Please look at the picture, what is the reason of this?

@VanitarNordic
Remove you fork of Darknet, and re-fork it with all last commits.
There was a bug of training:
Commits on Jan 10, 2017 @AlexeyAB
fixed bug: rand() for batches of images
Also it can be bad training dataset, or bad .cfg file.
Usually what is the condition of the training/validation images?
I mean how target objects should look like in training images? All of them should be 100% clear or if some of the target objects were partially covered with other non-target objects then it would be fine?
@VanitarNordic
Bingo, Thanks.
So by the way if we decided to detect a banana for example, then one good training image would be a banana in hand of a person, not a single banana in a white background, isn't it?
in my case with the above condition, the situation solved when I changed the height and width in the .cfg file from 448 * 448 to 416 * 416. Strange a bit.
@VanitarNordic
Yes, the different position of the object on a different background - a better solution.
Great. Thanks, I got my answer.
Please also reply my question on a competition results. I'm waiting for your reply for a long time (first open issue from bottom)
what is proble with this:
Region Avg IOU: 0.000145, Class: 1.000000, Obj: 0.026601, No Obj: 0.073908, Avg Recall: 0.000000, count: 12
Region Avg IOU: 0.016246, Class: 1.000000, Obj: 0.082520, No Obj: 0.084514, Avg Recall: 0.000000, count: 12
Region Avg IOU: 0.018083, Class: 1.000000, Obj: 0.083319, No Obj: 0.074546, Avg Recall: 0.000000, count: 12
217: 22914277376.000000, 2292594176.000000 avg, 0.005000 rate, 3.729000 seconds, 13888 images
Loaded: 0.000000 seconds
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.025444, Avg Recall: 0.000000, count: 13
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.025746, Avg Recall: 0.000000, count: 13
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.000000, No Obj: 0.026183, Avg Recall: 0.000000, count: 13
@OPPOA113
-1.#IND00 appears sometimes or always?.cfg-file?@AlexeyAB
1.start from number 217 to the end,and the loss was extremely large.22914277376.000000, 2292594176.000000 avg,
2.noly two place were changed for traing 1 classes.
learning_rate=0.0001
max_batches = 21000
policy=steps
steps=100,1000,1500,10000
scales=10,.1,.1,.1
.......
filters=30
activation=linear
[region]
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52
bias_match=1
classes=1
coords=4
......
3.traing cmd:
darknet.exe detector train ./data/voc.data yolo-voc.cfg darknet19_448.conv.23 > training1_2.log
What is the general cause of this situation?I did it all according to your instructions, but there are still some problems..
@OPPOA113
It is very strange, because if your init learning_rate=0.0001 and you set scales scales=10,.1,.1,.1 then learning rate can be 0.001, 0.0001, 0.00001, 0.000001.
But in your output, rate = 0.005:
217: 22914277376.000000, 2292594176.000000 avg, 0.005000 rate, 3.729000 seconds, 13888 images
Loaded: 0.000000 seconds
And yes, loss is extremely large 22914277376.000000
loss (error) for batch calculated here return (float)sum/(n*batch);: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L280
loss (error) for one image calculated here, only for last layer-30 if(net.layers[i].cost){: https://github.com/AlexeyAB/darknet/blob/b3a3e92e8a482cf1c49322561077e8f3be54d619/src/network.c#L183
.cfg-file? by using: http://pastebin.com/index.php printf("net.n = %d, count = %d, sum = %f \n", net.n, count, sum);
return sum/count;
}
printf(" n = %d, batch = %d, sum = %f \n", n, batch, sum);
return (float)sum/(n*batch);
}
@OPPOA113
Also, show by using: http://pastebin.com/index.php
both files:
./data/voc.datayolo-voc.cfgAnd did you have prepare a image dataset that has only a 1 class or more than 1 classes?
@AlexeyAB
Thank you very much for you answer. But the problem cannot be sloved.
The problem appears in the 718th iteration in the log file.
log:
Region Avg IOU: 0.123613, Class: 1.000000, Obj: 0.009103, No Obj: 0.017152, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 151.270279
Region Avg IOU: 0.141605, Class: 1.000000, Obj: 0.012890, No Obj: 0.021402, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 145.112289
Region Avg IOU: 0.159398, Class: 1.000000, Obj: 0.019039, No Obj: 0.023729, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 143.040283
Region Avg IOU: 0.134866, Class: 1.000000, Obj: 0.012214, No Obj: 0.021009, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 146.331787
Region Avg IOU: 0.156597, Class: 1.000000, Obj: 0.017798, No Obj: 0.023424, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 143.654922
Region Avg IOU: 0.125920, Class: 1.000000, Obj: 0.009804, No Obj: 0.017751, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 148.915298
Region Avg IOU: 0.136872, Class: 1.000000, Obj: 0.012135, No Obj: 0.020786, Avg Recall: 0.050000, count: 20
net.n = 31, count = 1, sum = 145.199615
Region Avg IOU: 0.146237, Class: 1.000000, Obj: 0.013695, No Obj: 0.021240, Avg Recall: 0.100000, count: 20
net.n = 31, count = 1, sum = 144.731476
n = 8, batch = 8, sum = 1168.255859
717: 18.253998, 253.565598 avg, 0.001000 rate, 3.728000 seconds, 45888 images
Loaded: 0.000000 seconds
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011391, No Obj: 0.010609, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011165, No Obj: 0.010613, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011104, No Obj: 0.010470, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011054, No Obj: 0.010635, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011532, No Obj: 0.010739, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011197, No Obj: 0.010464, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.010470, No Obj: 0.010762, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
Region Avg IOU: -1.#IND00, Class: 1.000000, Obj: 0.011000, No Obj: 0.010438, Avg Recall: 0.000000, count: 29
net.n = 31, count = 1, sum = -1.#IND00
n = 8, batch = 8, sum = -1.#IND00
718: -1.#IND00, -1.#IND00 avg, 0.001000 rate, 3.885000 seconds, 45952 images
Loaded: 0.000000 seconds
D:\YOLO\ZEM_trainzem_data\Image\0949.jpg
D:\YOLO\ZEM_trainzem_data\Image\1266.jpg
D:\YOLO\ZEM_trainzem_data\Image\1869.jpg
D:\YOLO\ZEM_trainzem_data\Image\0882.jpg
D:\YOLO\ZEM_trainzem_data\Image\1844.jpg
.........
yolo-voc.cfg:
learning_rate=0.0001
max_batches = 41000
policy=steps
steps=100,25000,35000
scales=10,.1,.1,
...
pad=1
filters=30
activation=linear
[region]
anchors = 1.08,1.19, 3.42,4.41, 6.63,11.38, 9.42,5.11, 16.62,10.52
bias_match=1
classes=1
coords=4
voc.data:
classes= 1
train = D:/YOLO/ZEM_train/zem_data/voc_person_train.txt
valid = D:/YOLO/ZEM_train/zem_data/person_val.txt
names = D:/YOLO/ZEM_train/zem_data/pasacal.names
backup = D:/YOLO/ZEM_train/backup/
training cmd:
darknet.exe detector train D:/YOLO/ZEM_train/zem_data/voc.data D:/YOLO/ZEM_train/yolo-voc.cfg D:/YOLO/ZEM_train/darknet19_448.conv.23 > training2.log
testing cmd:
darknet.exe detector test D:/YOLO/ZEM_train/zem_data/voc.data D:/YOLO/ZEM_train/yolo-voc.cfg D:/YOLO/ZEM_train/backup/yolo-voc_600.weights -i 0 -thresh 0.058124
Nothing was detected......
There is a problem with your data-set.
Besides, threshold is a value between 0.1 and .9 (default 0.25)
rename your image containing root folder to obj (as mentioned in the description) and train again
@VanitarNordic @AlexeyAB
data-set?Which do you mean?
threshold between 0.1 and .9 have the same results.Nothing was detectbut......
My file structure is shown in Figure:



Follow the instruction of the author and put your images in a folder with the same name as he mentioned. change "zem_data" to "obj" and see the difference.
@OPPOA113
Try to update darknet from my fork, last commit fix some bug with rand() on Windows: https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548
Try to open your dataset by using Yolo-mark: https://github.com/AlexeyAB/Yolo_mark
Look at each image by clicking space-button, and at the end press ESC.
Are all bound boxes correct?
Try to uncomment this line printf("grp: %s\n", paths[index]);: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L56
for(i = 0; i < n; ++i){
int index = rand()%m;
random_paths[i] = paths[index];
//if(i == 0) printf("%s\n", paths[index]);
printf("grp: %s\n", paths[index]);
}
Try to train again, and put output when error occurs.
Also, you don't need to train all 45000 iterations to make some tests. train for 1000 iterations, interrupt it and just check if everything is correct by some detection, although with lower accuracy
@AlexeyAB @AlexeyAB
finally,it work. thank you.
I also have a quetion about small target detection box, there are some offset in the results. what is the main problem? And Where should I solve? adding more small targe training set, or adjust some parameter, even should I change network? the test results as shown below using the official model.
give me some advice please. thank you.

Try to change height=608 and width=608 in your cfg-file: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/cfg/yolo-voc.cfg
And use it with your the same trained weight-file for detection on the same image. Show here result image.
@AlexeyAB
Much better now.thank you.
@AlexeyAB
What makes a difference between running a training command by executing a .cmd file (for example train.cmd), or by writing the command inside console, or running the train.cmd inside the console?
When I execute the train.cmd file directly, it causes -nan problem between 300-400 iterations as the first picture in this issue, BUT when I start training by writing the command inside console, or executing the train.cmd inside console, it trains flawlessly till the end without any problem!
This is the command: darknet.exe detector train data/obj.data yolo-obj.cfg darknet19_448.conv.23
I have no idea!
@VanitarNordic There is no difference, there must be the same. Perhaps the wrong images-dataset, and the error occurs randomly.
@AlexeyAB
Not really. everything is correct with the data-set. if that's a wrong data-set, why I can train it very long till 6000 or more? only happens when I directly execute train.cmd. When it gonna happen, I see recall=0 in previous iterations before -nan happens. besides, data-set is really small to have unseen errors. it only contains 200 images. impossible.
@VanitarNordic This is very strange. Can you reproduce this fail many times?
May be some dll-files or other .data/.cfg/.weights-files taked from different paths when click train.cmd.
This is very strange. Can you reproduce this fail many times?
No I can not.
Now I tested again and it did not happen. as I said Dataset is quite small and I have annotated them by yolo_mark and I have tested it at least 20-30 times, but the only time it happened was when I run the train.cmd directly. It is impossible to have errors in dataset, and even if it has, I should not be able to train it for several thousand.
@VanitarNordic Also, rarely it (NAN) can be even for well-formed dataset, occasionally. I also met with such a mistake, but without any changes re-training ended well.
I think that since it does not reproduced, then it happened by accident, no matter how you started the training.
Did you update last commits, esepcially this fix - 2 days ago? https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548
There was a bug, when for single batch loaded the same images 8 times - this worsened the training and increased the likelihood of such fail.
Did you update last commits, esepcially this fix - 2 days ago?
No I have not.
There was a bug, when for single batch loaded the same images 8 times - this worsened the training and increased the likelihood of such fail.
Yes, that could be the reason of that.
One question, Do these bugs exist inside darknet Yolo itself, or they may exist because of the modifications in the windows distribution?
@VanitarNordic It was only in Windows-fork, because rand() function works differently than on Linux.
@AlexeyAB
Let me ask this question here.
I downloaded the updated repository and tried to build, but I get one error (also 1615 warnings):
LNK1104 cannot open file 'opencv_core249.lib
It is strange, because I can compile the older repository without any problem.
@VanitarNordic
Do you use OpenCV 2.4.9 in directory C:\opencv_2.4.9?
Check all how to compile settings: https://github.com/AlexeyAB/darknet#how-to-compile
1 paths: C:\opencv_2.4.9\opencv\build\include & C:\opencv_2.4.9\opencv\build\x64\vc12\lib or vc14\lib
3.2 (right click on project) -> properties -> Linker -> General -> Additional Library Directories
3.2 (right click on project) -> properties -> Linker -> General -> Additional Library Directories
The problem was here and solved. I had to change vc14 to vc12. it seems you had modified it. Would you please share your vc14 within the repository?
@VanitarNordic Yes, thank you, correct line-148 in darknet.vcxporj, not line-94.
Fixed: https://github.com/AlexeyAB/darknet/blame/master/build/darknet/darknet.vcxproj#L148
@AlexeyAB
I tried the modified code and trained again (datasets is exactly the same). These are the results of training after 2000 iterations. Strange that the validation error got worse.
1) Old repository before recent update: Training=0.04 avg , validation=76%
2) New repository after the recent update: Training=0.003 avg , Validation = 67%
Validation error got worse for around 10% which is significant!
Strange.
@VanitarNordic detector recall don't show error, it shows precision. Do you mean IoU?
Learning avg error is small - it's good.
@AlexeyAB
Yes, I mean IOU, which has decreased about 10% and I see the difference when I test it on unseen pictures. definitely 76% is better than 67%.
@AlexeyAB
Identical dataset, 2000 iterations.
1) Old repository:

2) Updated repository:

@VanitarNordic It would be strange if the old bug would improve the learning curve :)
What steps, sacles, learning_rat and random parameters did you use?
Try to set steps=100,1200,1750 and train agai, what will be IoU? https://github.com/AlexeyAB/darknet/blob/master/cfg/yolo-voc.cfg#L17
Also, did you use cuDNN in both cases? (5.) https://github.com/AlexeyAB/darknet#how-to-compile
By default in new repository CUDNN is disabled.
@AlexeyAB
I used exactly the same cfg files. So everything is identical (including steps and so on). Yes, I added CUDNN in both cases and complied and generated Darknet.exe.

I don't know why training error diminished but the precision decreased. Our goal should be increasing the IOU and Recall.
@VanitarNordic Yes, IoU and Recall should be increased.
Try to set float thresh = .2; in validate_detector_recall()-function and do detector recall again, without re-training: https://github.com/AlexeyAB/darknet/blob/47409529d0eb935fa7bafbe2b3484431117269f5/src/detector.c#L385
@VanitarNordic I'll test on my datasets, and if I will get the same results, then maybe I'll think about leaving that bug :)
@AlexeyAB
Yes, I had change another part, I found and modified and compiled again. This is the proof:

But oh my goodness, now the IOU even decreased! RPs/Img has changed.

@AlexeyAB
I did some tests again. I copied an old .weights from the old repository folder, and tested it with new repository compiled files. The results is good as the old one.
Therefore we can understand there is a problem in training with new repository and the related .weights file
@VanitarNordic Yes. Also, try validate with float thresh = .2; old weight, and show result: RPs, Iou and Recall.
I'm trying to figure out what's going on.
@AlexeyAB
I did. Old weights and using the new repository with float thresh = .2;

@AlexeyAB
after many many experiments, I found a solution to this. Do you want that I update the code as a contributor or share it here?
The problem was your method to seed the srand
@VanitarNordic
As you prefer. Right here or create pull-request.
What are the details of the error?
@AlexeyAB
There is no error with your code. but I updated the method of seeding the srand.
Actually rand() function in C is not good as its successors in C++, as we should seed it with srand(). and time(0) is a good seed for that, but it can be improved a bit, but with a different method.
This article helped me to improve seeding: http://www.eternallyconfuzzled.com/arts/jsw_art_rand.aspx
But, if we don't use this extra function and just use srand(time(0)) will work, but using this function will make it better:
unsigned time_seed()
{
time_t now = time(0);
unsigned char *p = &now;
unsigned seed = 0;
for (size_t i = 0; i < sizeof now; i++)
{
seed = seed * (UCHAR_MAX + 2U) + p[i];
}
return seed;
}
char **get_random_paths(char **paths, int n, int m)
{
char **random_paths = calloc(n, sizeof(char*));
int i;
pthread_mutex_lock(&mutex);
srand(time_seed());
//printf("n = %d \n", n);
for (i = 0; i < n; ++i) {
int index = rand() % m;
random_paths[i] = paths[index];
//if(i == 0) printf("%s\n", paths[index]);
//printf("grp: %s\n", paths[index]);
}
pthread_mutex_unlock(&mutex);
return random_paths;
}
@VanitarNordic
But it my commit-fix, I use seed(time(0)) only as initial, and for other time use srand(mt_seed); https://github.com/AlexeyAB/darknet/commit/4422399e298e40629db70642e781ddd76f460548#diff-2ceac7e68fdac00b370188285ab286f7R50
It is done because function char **get_random_paths(char **paths, int n, int m) called from 8 threads simultaneously: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L780
But time(0) returns time accurate to seconds http://en.cppreference.com/w/c/chrono/time
So if we use seed(time(0)) or any varian of seed(time_seed()) then all 8 threads have the same value, because executed in prarallel, then in 1 batch used at 8 times less images, but each image 8 times.
So your fix will use again seed(time(0)) as before my fix, but may be with better time-function time_seed(). Then again all 8 threads will have the same value, and in 1 batch will be used at 8 times less images, but each image 8 times: https://github.com/AlexeyAB/darknet/blob/4422399e298e40629db70642e781ddd76f460548/src/data.c#L780
@AlexeyAB
Hmmm, things going to be complex. So because you update the seed before unlocking the mutex then we will get more random values?
I think maybe your updated seeds do not include all images randomly in a good way, that's might be the reason of that extremely low training error, but a 10% decrease in IOU.
After I trained with this new code, it increased the IOU. I got between 75% to 76%. So why? avg error is around 0.02.
Also I realized using rand() * rand() will increase the IOU as you had used before, I never understood why.
@VanitarNordic
I have not yet tested the distribution of random values. I'll check it out.
If it turns out to be not uniform, then I will correct it.
If it turns out to be uniform, then there are 2 options:
Try to train with my last code, but increase learning_rate at 10x in .cfg-file, and train only 200 iterations, what will IoU?
@AlexeyAB
Let me tell you something, although I get a better IOU, but still the old .weights file (before your new update) detects better with a higher confident! interesting isn't it? Both show almost the same IOU, but one detects better!
Okay, I will do that. is there anyway to store .weights file for each 100 iterations after we go above 1000?
(for example a weight in 1100 .. 1200 .. 1300)
@VanitarNordic But what do you mean about "Both show almost the same IOU"?
You said that old weights shows IoU 75.85% but new wights only 66.81% https://github.com/AlexeyAB/darknet/issues/19#issuecomment-285631285
Yes, try with this my commit, but with 10x more learning_rate, and only 200 iterations.
@AlexeyAB
I mean:
Old weight IOU (old repository before your recent update): 75%
New weight IOU (after your recent update): 66%
New weight IOU (with my own modifications as I mentioned above): 75%
But detection of the older one is better.
@AlexeyAB
I did your suggestion now. few iterations after 100, avg errors went up and finally it went to -nand situation
@AlexeyAB
Also I did more iterations to 3000 and 4000 after your new update before, but IOU did not rise above 66%.
@AlexeyAB
Alright, I tested another solution and now works as good as past or even better. The trick is that I used GetTickCount()function which handles a resolution in millisecond:
char **get_random_paths(char **paths, int n, int m)
{
char **random_paths = calloc(n, sizeof(char*));
int i;
pthread_mutex_lock(&mutex);
srand(GetTickCount());
//printf("n = %d \n", n);
for (i = 0; i < n; ++i) {
int index = rand() % m;
random_paths[i] = paths[index];
//if(i == 0) printf("%s\n", paths[index]);
//printf("grp: %s\n", paths[index]);
}
pthread_mutex_unlock(&mutex);
return random_paths;
}
Results:

@AlexeyAB
I finally reached to a solution by using the rand_s function. The rand_s is a Windows compatible function which has been added to the VS since the VS 2005. it generates random number within unsigned int range. because rand() % m actually generates random numbers between 0 and m-1, then we have to modify it to produce numbers between 0 and m-1.
Here is the code. You can verify the random generation by yourself.
inline int randomGen(int nMin, int nMax) {
unsigned int Num = 0;
rand_s(&Num);
unsigned int n_s = nMin + Num % (nMax + 1 - nMin);
return n_s;
}
char **get_random_paths(char **paths, int n, int m)
{
char **random_paths = calloc(n, sizeof(char*));
int i;
pthread_mutex_lock(&mutex);
for (i = 0; i < n; ++i) {
int index = randomGen(0, m-1);
//printf("index = %d \n", index);
random_paths[i] = paths[index];
}
pthread_mutex_unlock(&mutex);
return random_paths;
}
@VanitarNordic Thank you! This looks correct. What avg_loss, IoU & Recall did you achive with your last randomGen() function?
@AlexeyAB
You're welcome. I learned many thing of you, this was the smallest thing I could add. I experimented too much. during training from the count parameter is also obvious that numbers are correctly randomized.
Here is the results, During experiments with this new code, IOU was around 74% to 76% and Recall was around 88% to 90%.
The detection is also good.

@AlexeyAB
Also, if there is another critical area inside the code, which we could replace with this new random function, just let me know, to modify and make another test. Actually I replaced all rand() functions inside data.c file now and result is Okay.
@VanitarNordic
I fixed all rand() to rangom_gen() which implemented by using rand_s() in last commit: https://github.com/AlexeyAB/darknet/commit/a71bdd7a83e33f28d91b88551b291627728ee3e7
I tested 3 cases and got distributions of random values:
Your offer (3) with rand_s() is the closest to the normal distribution - and has the lowest value of standard deviation from average value:

rand_s() has only 4 unused images of 7587 images after 1000 iterations - that good:
rand_s() uses different values in different threads - that good:
@AlexeyAB
I really loved your scientific analysis. I didn't know about these calculations.
There are some other areas that rand() have been used, I mean in other files rather than just data.c , do you think it is a good idea to replace those also and make another test?
@VanitarNordic
Yes, you can try to replace other rand() to random_gen() in other files. For example in: detector.c, detection_layer.c, network.c, utils.c, crop_layer.c, gemm.c, matrix.c and re-train Yolo.
Perhaps the changes in these files, especially in files detection_layer.c, network.c, crop_layer.c - can improve training.
@AlexeyAB
I did but it makes the Darknet.exe to crash when I start training. I think we should be careful about the range also. rand_s provides random numbers within unsigned int range. Also, I just replaced them all by find-replace. maybe for some code areas, it will not work just by doing this.
@VanitarNordic
Did you use exactly this implementation of unsigned int random_gen() or what that other?
inline unsigned int random_gen()
{
unsigned int Num = 0;
rand_s(&Num);
return Num;
}
@AlexeyAB
I did with your modified code. I downloaded the new repository and tried to replace all. You can try yourself.
@AlexeyAB
Alright, I investigated in the code and I found where this crash comes from. I replaced rand() with random_gen() in files one by one and finally the problem rises from a function inside utils.c:
float rand_uniform(float min, float max)
{
if(max < min){
float swap = min;
min = max;
max = swap;
}
return ((float)rand()/RAND_MAX * (max - min)) + min;
}
if we replace this rand() with random_gen(), then Darknet.exe will crash. What do you think?
@AlexeyAB
I did not touch that function which was causing the crash, modified the rest and re-trained the model again (2000 iterations). The results got better. Recall was also improved about 4% to 6%.
Also I realized the the detection speed got slightly faster for around 2FPS or more. Also I realized that the detection of small objects has significantly improved even on 416*416 network size. The input test video resolution was 1024*768. if you could solve that issue just let me know to change and re-test.

I got the same problem here while training my own data on CPU. I trained at least seven times, all end up failure, with a lot of nan, -nan(ind).
Also, I followed all the steps in README, and I used the last committed version.
At first I thought there was a "divide by zero" error. I added some code like if(count==0)count=1; before avg_iou/count, this didn't help.
Debugging is extremely slow, It may take a very very long time to get this error.
After several failures, I found a pattern where the first nan usually occurs when the value of count suddenly becomes larger. So I deleted some images with many annotations and overlapping annotations. But that's still not helpful.
My images are from the INRIA person database. I use Yolo_mark to get the annotations.
My config:
person.data
```classes= 1
train = data/train.txt
valid = data/test.txt
names = data/person.names
backup = backup/
**person.names**
person
**yolo-person.cfg**
[net]
batch=1
subdivisions=1
...
[convolutional]
size=1
stride=1
pad=1
filters=30
activation=linear
[region]
anchors = 1.3221, 1.73145, 3.19275, 4.00944, 5.05587, 8.09892, 9.47112, 4.84053, 11.2364, 10.0071
bias_match=1
classes=1
coords=4
num=5
softmax=1
jitter=.3
rescore=1
...
**train_person.cmd**
darknet_no_gpu.exe detector train data/person.data yolo-person.cfg darknet19_448.conv.23
```
This problem drives me crazy. Please help me out.
@chasonlee Hi,
How many iterations did you do maximum?
There can be bugs in training Yolo on CPU. I think very few people try to train on CPU, because it requires many years.
What batch and subdivision number did you set? https://github.com/AlexeyAB/darknet/blob/85d1416ff06846f11ad943e86d42b1f16bf36518/cfg/yolo-voc.cfg#L6
Try to set batch=64 and subdivisions=4
oh man forget about CPU, otherwise you plan to put your lifetime on!
@AlexeyAB Hi, thanks for the quick response.
CPU is too slow, I will find a computer with a GPU to try again.
@VanitarNordic That's true. I totally agree...
Most helpful comment
oh man forget about CPU, otherwise you plan to put your lifetime on!