Darknet: I got all NaN after 5000 iterations ???? Are there something went wrong ?

Created on 20 Apr 2018 · 18Comments · Source: pjreddie/darknet

I got log look like below:

Region Avg IOU: 0.000000, Class: 0.000000, Obj: 0.000000, No Obj: 0.004019, Avg Recall: 0.000000,  count: 1
931: 86426484736.000000, 8648297472.000000 avg, 0.000751 rate, 0.089617 seconds, 931 images
Loaded: 0.000031 seconds
Region Avg IOU: 0.000000, Class: 0.000000, Obj: 0.800000, No Obj: 0.784000, Avg Recall: 0.000000,  count: 5
932: 488075076341487108096.000000, 48807505874930106368.000000 avg, 0.000755 rate, 0.272508 seconds, 932 images
Loaded: 0.000040 seconds
Region Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 1.000000, Avg Recall: -nan,  count: 0
933: inf, inf avg, 0.000758 rate, 0.270606 seconds, 933 images
........

And after that, I got all Nan. Anyone meet that error ?

Source

khiemntu

Most helpful comment

@khiemntu .I met the same issue.But I found I got a misstake in the cfg.

[net]
# Testing
batch=1
subdivisions=1
# Training
 batch=64
 subdivisions=16

I modified to :

[net]
# Testing
# batch=1
# subdivisions=1
# Training
 batch=64
 subdivisions=16

Then it's fine,haha

XiaXuehai on 31 May 2018

👍6

All 18 comments

It always caused by a bad sample in that iteration i think...

YefeiGao on 20 Apr 2018

What is the bad sample. Can you explain more detail ?

khiemntu on 20 Apr 2018

In yolov3 use independent logistic classifiers with binary cross-entropy loss. If any training sample lead the output of sigmoid to 0, the loss will come to nan for log(0).

YefeiGao on 20 Apr 2018

👍2

How can i fix it ? I can not trainning more interation if this error still alive. But when I set random=0 in cfg file, this error doesn't appear ??? I dont know why ?

khiemntu on 21 Apr 2018

I am not sure the real reason. If that caused by the reason i referred, the bad training sample have to be found and removed. I met this situation but it didn't lead a large impact in next iteration. So i just ignored that... @khiemntu

YefeiGao on 21 Apr 2018

@YefeiGao I just followed the instruction in the website and got the similar error. I think it is not the problem of the dataset and I used exactly Pascal VOC dataset 2007 as instructed by the official darknet website.

arasharchor on 23 Apr 2018

👍2

Hi @smajida , hope this can answer the question #715

YefeiGao on 24 Apr 2018

👍1

@YefeiGao thanks. that is a help indeed. it solved my concern.

arasharchor on 25 Apr 2018

🎉1 👍1

@smajida How can you solve this problem ?

khiemntu on 25 Apr 2018

@khiemntu actually there is no solution for that as sussgested by https://github.com/pjreddie/darknet/issues/715
if average loss stays nan for a while training went wrong, but for my case after 3000 iteration it was not any more nan.

arasharchor on 25 Apr 2018

Hi,

From your log, I think you set batch=1 ( since each iter only has one img). That might be the key to he problem. If so, set batch=64, subdivisions=8 in yolo.cfg file, that would help you to solve the problem.

Pattorio on 7 May 2018

👍1

Hi @Pattorio
I have set batch=64 and sub=8, but It still happen

khiemntu on 7 May 2018

Hi @khiemntu I am trying to solve this problem with my model build on darkflow. Have you come up with any solutions since your last post?

kribby on 12 May 2018

Hi @kribby , I dont have any solution, it still happen. Do you have any idea ?

khiemntu on 12 May 2018

@khiemntu I'm afraid I don't - still trying to work it out

kribby on 12 May 2018

@khiemntu .I met the same issue.But I found I got a misstake in the cfg.

[net]
# Testing
batch=1
subdivisions=1
# Training
 batch=64
 subdivisions=16

I modified to :

[net]
# Testing
# batch=1
# subdivisions=1
# Training
 batch=64
 subdivisions=16

Then it's fine,haha

XiaXuehai on 31 May 2018

👍6

I am also getting the same error, Any body can advise how to resolve this

Richard-Bebin on 30 Aug 2019

I had this happen when I tried to set Channels = 1.

unnamed7 on 27 May 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

make error for yolo v3 on linux

spaul13 · 3Comments

Issue compiling Darknet with GPU

cadip92 · 3Comments

Compile error: ‘memcpy’ was not declared in this scope

kthordarson · 3Comments

Just one CPU core is being used for object detection in darknet tiny yolo on Raspberry pi 3 B+

Vikalp-Reorder · 3Comments

Why the bboxes have a coordinate offsets with python interface?

AaronYKing · 3Comments