Darknet: yolov3: "Avg IOU: -nan, Class: -nan, Obj: -nan" while Retraining on voc data to detect custom class along with existing 20 classes

Created on 2 Apr 2018  ·  20Comments  ·  Source: pjreddie/darknet

I am retraining yolov3 to detect custom class along with 20 existing classes using VOc 2007 and 2012 dataset. I followed instructions on

"https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects"

Training is happening on GPU,

but intermittently I am getting few IOU, Class, Object as -nan.

Snapshot is as below

Region 82 Avg IOU: 0.236180, Class: 0.205503, Obj: 0.004179, No Obj: 0.009506, .5R: 0.000000, .75R: 0.000000, count: 1
Region 94 Avg IOU: 0.150655, Class: 0.315263, Obj: 0.005658, No Obj: 0.004314, .5R: 0.000000, .75R: 0.000000, count: 4
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.001968, .5R: -nan, .75R: -nan, count: 0
279: 33.403381, 14.190192 avg, 0.000006 rate, 29.993717 seconds, 279 images
Loaded: 0.000050 seconds

I already done changes to obj.names obj.data as mentioned. My doubt is whether its okay to get -nan in training or I am missing out something.

Also how should I debug the root cause if -nan are not welcome

Most helpful comment

I am also training VOC dataset and have same problem, it's fine for first 855 pictures.
anomaly

All 20 comments

Just do the training for atleat 5k iterations , these nans will decrease . They generally come when subdivision batch has zero annotation (see count=0) . It can be because the anchors are big or object dimensions are too small.

When I have finished retraining, I only get below results
**

Loading weights from yolo-obj_800.weights ...Done!
data/dog.jpg: Predicted in 21.787142 seconds.

**

But not like below, with detection probability and also Prediction Boxes are missing too.

> gaurav@Desktop-192:~/yolov3/darknet$
> Loading weights from yolov3.weights...Done!
> data/dog.jpg: Predicted in 21.787142 seconds.
> dog: 99%
> truck: 92%
> bicycle: 99%

gaurav@Desktop-192:~/yolov3/darknet$

change batch and subdivisions to 1 in the cfg and try
./darknet detector test your_data.data your_cfg.cfg your_trained_model.weights data/dog.jpg

It didnt help.Attaching snapshot

105 conv 78 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 78 0.108 BFLOPs
106 detection
Loading weights from backup/yolo-obj_800.weights...Done!
data/dog.jpg: Predicted in 21.481505 seconds.
gaurav@Desktop-192:~/yolov3/darknet$

It doesnt add bounding box in prediction.png

Also Training keeps showing NAN but initial iteration does had valid non NAN values

1968: nan, -nan avg, 0.001000 rate, 51.805366 seconds, 1968 images
Loaded: 0.000051 seconds
Region 82 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1
Region 94 Avg IOU: -nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: nan, .5R: -nan, .75R: -nan, count: 0

./darknet detector train cfg/voc.data cfg/yolov3-voc.cfg darknet53.conv.74
5642: -nan, nan avg, 0.001000 rate, 0.117762 seconds, 5642 images
Loaded: 0.000021 seconds
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 3
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0

default config(batch=1, subdivisions=1) is just for test, uncomment Training batch=64 subdivisions=8 to train. correct me if i'm wrong

I with setting mentioned above only I am unable to detect the object. it just shows till
105 conv 78 1 x 1 / 1 52 x 52 x 256 -> 52 x 52 x 78 0.108 BFLOPs
106 detection
Loading weights from backup/yolo-obj_800.weights...Done!

there is no bounding box or below kinda
dog: 99%
truck: 92%
bicycle: 99%

results shown

if you are not getting annotated image at the end , the problem is with the opencv version. Which OPencv version do you have . Try to install 3.4.1 and re-test

@ztilottama I got the same question. do you fix it and how?

Same problem. I'm training on VOC 2012 like it described here: https://pjreddie.com/darknet/yolo/
At first I'm getting something that shows that data and labels are loaded correctly and network is training.
Region 82 Avg IOU: 0.104098, Class: 0.433959, Obj: 0.271554, No Obj: 0.476763, .5R: 0.000000, .75R: 0.000000, count: 2
Then, after ~1000 of iterations (with batch of 1 image) I'm getting only
Region 82 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 5
Notice that count is non-zero (obviously IOU is nan if count=0).
Could that be that training is very unstable with such small batch size? I can't test with bigger batch, cause that's all that my video card can load.

Tested on different PC - yes, that was the source. I think issue can be closed: just use bigger batch.
But I have another problem now, also not answered, here: https://github.com/pjreddie/darknet/issues/665

@justanotheruser : What is the root cause, did you figure it out? Can you provide me the solution which solved this problem. What was the batch size you have used

@ztilottama : I used batch=64 and subdivisions=16, but it can be specific to your task. With single image SGD was diverging.

I am also training VOC dataset and have same problem, it's fine for first 855 pictures.
anomaly

@axzxc1236 it's seems that your loss is incredible high

Note: If during training you see nan values for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.
https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

Same problem. I have changed batch and subdivisions to 1 in the cfg and try
./darknet detector test cfg/coco.data cfg/yolov3.cfg yolov3.weights data/00001.jpg

When I have finished retraining, I only get below results:
*
Loading weights from yolov3.weights...Done!
data/00001.jpg: Predicted in 0.035034 seconds.
*

No detection probability and also Prediction Boxes are missing too

I got the same question. do you fix ??

https://medium.com/@manivannan_data/how-to-train-yolov3-to-detect-custom-objects-ccbcafeb13d2
I had the same problem, and this post solved it. we need to change some more filters and batch size to 24.

Line 3: set batch=24, this means we will be using 24 images for every training step
    Line 4: set subdivisions=8, the batch will be divided by 8 to decrease GPU VRAM requirements.
    Line 603: set filters=(classes + 5)*3 in our case filters=21
    Line 610: set classes=2, the number of categories we want to detect
    Line 689: set filters=(classes + 5)*3 in our case filters=21
    Line 696: set classes=2, the number of categories we want to detect
    Line 776: set filters=(classes + 5)*3 in our case filters=21
    Line 783: set classes=2, the number of categories we want to detect

like this (when classes =2)

Just do the training for atleat 5k iterations , these nans will decrease . They generally come when subdivision batch has zero annotation (see count=0) . It can be because the anchors are big or object dimensions are too small.

Hi, can you please specify what "anchors" refer to? thnx!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ghost picture ghost  ·  4Comments

cadip92 picture cadip92  ·  3Comments

gpsmit picture gpsmit  ·  3Comments

HoracceFeng picture HoracceFeng  ·  3Comments

MaverickLoneshark picture MaverickLoneshark  ·  3Comments