Hi @AlexeyAB
I am training the model for my own dataset using your repository, My dataset is consist of 2 classes and consist of 165 images for each class.
I have labeled my dataset with OpenLabeling tool.
I use batch=16 and subdivisions=8, because of my graphics card. I have a laptop with GTX1060.
I am seeing -nan in some lines.
why these -nan happens? what should I do? I followed every step.
gion 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013198, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006256, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.355292, Class: 0.411376, Obj: 0.029451, No Obj: 0.029367, .5R: 0.000000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013234, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006058, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.511116, Class: 0.377528, Obj: 0.046407, No Obj: 0.029295, .5R: 0.333333, .75R: 0.333333, count: 3
Region 94 Avg IOU: 0.281963, Class: 0.534199, Obj: 0.008859, No Obj: 0.013083, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006088, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.370150, Class: 0.434698, Obj: 0.026954, No Obj: 0.028650, .5R: 0.000000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013133, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006082, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.406275, Class: 0.392887, Obj: 0.033580, No Obj: 0.029484, .5R: 0.333333, .75R: 0.000000, count: 3
Region 94 Avg IOU: 0.285474, Class: 0.520197, Obj: 0.012111, No Obj: 0.013386, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006105, .5R: -nan, .75R: -nan, count: 0
231: 2.865283, 2.211494 avg loss, 0.000003 rate, 2.294709 seconds, 3696 images
Loaded: 0.000018 seconds
Region 82 Avg IOU: 0.613693, Class: 0.401125, Obj: 0.032530, No Obj: 0.028400, .5R: 1.000000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.012739, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.005937, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.157492, Class: 0.394678, Obj: 0.029851, No Obj: 0.029231, .5R: 0.000000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.012971, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.005962, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.521762, Class: 0.290676, Obj: 0.046083, No Obj: 0.028440, .5R: 0.500000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.012896, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006019, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.591077, Class: 0.390669, Obj: 0.040413, No Obj: 0.029186, .5R: 0.500000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013340, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006082, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.511808, Class: 0.591287, Obj: 0.026345, No Obj: 0.028927, .5R: 0.500000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013040, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006123, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.562955, Class: 0.375633, Obj: 0.035671, No Obj: 0.028555, .5R: 0.500000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.013058, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.005932, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.541707, Class: 0.595982, Obj: 0.035991, No Obj: 0.028347, .5R: 0.500000, .75R: 0.000000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.012863, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.006033, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: 0.561158, Class: 0.478616, Obj: 0.032940, No Obj: 0.028051, .5R: 0.500000, .75R: 0.500000, count: 2
Region 94 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.012853, .5R: -nan, .75R: -nan, count: 0
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.005970, .5R: -nan, .75R: -nan, count: 0
count 0 will be nan. It's normal.
I am using coco dataset for training, but I have nan problem after a while, even count is not 0, and keep nan all the time. I have to cancel it.
anybody know why this happens? is it because I was training too many iterations?
Tensor Cores are used.
121592: -nan, -nan avg loss, 0.001000 rate, 0.401613 seconds, 972736 images
Loaded: 0.000026 seconds
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 3
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 2
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 4
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 4
Region 106 Avg IOU: nan, Class: 0.000000, Obj: 0.000000, No Obj: 0.000000, .5R: 0.000000, .75R: 0.000000, count: 3
Region 82 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 3
Region 94 Avg IOU: nan, Class: nan, Obj: nan, No Obj: nan, .5R: 0.000000, .75R: 0.000000, count: 1
Region 106 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.000000, .5R: -nan, .75R: -nan, count: 0

@PROGRAMMINGENGINEER-NIKI Hi,
231: 2.865283, 2.211494 avg loss, 0.000003 rate, 2.294709 seconds, 3696 images
Loaded: 0.000018 seconds
All is good. https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects
Note: If during training you see
nanvalues for avg (loss) field - then training goes wrong, but if nan is in some other lines - then training goes well.
@i-chaochen
Try to disable Tensor Cores, set CUDNN_HALF=0 in the Makefile and do make
Tensor Cores are used.
121592: -nan, -nan avg loss, 0.001000 rate, 0.401613 seconds, 972736 images
Loaded: 0.000026 seconds
Most helpful comment
@i-chaochen
Try to disable Tensor Cores, set
CUDNN_HALF=0in the Makefile and domake