Darknet: I got all NaN after 5000 iterations ???? Are there something went wrong ?

Created on 20 Apr 2018  路  18Comments  路  Source: pjreddie/darknet

I got log look like below:

Region Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.004019, Avg Recall: 0.000000,  count: 1
931: 86426484736.000000, 8648297472.000000 avg, 0.000751 rate, 0.089617 seconds, 931 images
Loaded: 0.000031 seconds
Region Avg IOU: 0.000000, Class: 0.000000, Obj: 0.800000, No Obj: 0.784000, Avg Recall: 0.000000,  count: 5
932: 488075076341487108096.000000, 48807505874930106368.000000 avg, 0.000755 rate, 0.272508 seconds, 932 images
Loaded: 0.000040 seconds
Region Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 1.000000, Avg Recall: -nan,  count: 0
933: inf, inf avg, 0.000758 rate, 0.270606 seconds, 933 images
........

And after that, I got all Nan. Anyone meet that error ?

Most helpful comment

@khiemntu .I met the same issue.But I found I got a misstake in the cfg.

[net]
# Testing
batch=1
subdivisions=1
# Training
 batch=64
 subdivisions=16

I modified to :

[net]
# Testing
# batch=1
# subdivisions=1
# Training
 batch=64
 subdivisions=16

Then it's fine,haha

All 18 comments

It always caused by a bad sample in that iteration i think...

What is the bad sample. Can you explain more detail ?

In yolov3 use independent logistic classifiers with binary cross-entropy loss. If any training sample lead the output of sigmoid to 0, the loss will come to nan for log(0).

How can i fix it ? I can not trainning more interation if this error still alive. But when I set random=0 in cfg file, this error doesn't appear ??? I dont know why ?

I am not sure the real reason. If that caused by the reason i referred, the bad training sample have to be found and removed. I met this situation but it didn't lead a large impact in next iteration. So i just ignored that... @khiemntu

@YefeiGao I just followed the instruction in the website and got the similar error. I think it is not the problem of the dataset and I used exactly Pascal VOC dataset 2007 as instructed by the official darknet website.

Hi @smajida , hope this can answer the question #715

@YefeiGao thanks. that is a help indeed. it solved my concern.

@smajida How can you solve this problem ?

@khiemntu actually there is no solution for that as sussgested by https://github.com/pjreddie/darknet/issues/715
if average loss stays nan for a while training went wrong, but for my case after 3000 iteration it was not any more nan.

Hi,

From your log, I think you set batch=1 ( since each iter only has one img). That might be the key to he problem. If so, set batch=64, subdivisions=8 in yolo.cfg file, that would help you to solve the problem.

Hi @Pattorio
I have set batch=64 and sub=8, but It still happen

Hi @khiemntu I am trying to solve this problem with my model build on darkflow. Have you come up with any solutions since your last post?

Hi @kribby , I dont have any solution, it still happen. Do you have any idea ?

@khiemntu I'm afraid I don't - still trying to work it out

@khiemntu .I met the same issue.But I found I got a misstake in the cfg.

[net]
# Testing
batch=1
subdivisions=1
# Training
 batch=64
 subdivisions=16

I modified to :

[net]
# Testing
# batch=1
# subdivisions=1
# Training
 batch=64
 subdivisions=16

Then it's fine,haha

I am also getting the same error, Any body can advise how to resolve this

I had this happen when I tried to set Channels = 1.

Was this page helpful?
0 / 5 - 0 ratings