Describe the bug
The training loss is not expected
I don't know what happened, may my command is not ture.
Please give me some suggestion
0/272 18/7328 nan nan nan nan nan 116 0.273
0/272 19/7328 nan nan nan nan nan 119 0.273
0/272 20/7328 nan nan nan nan nan 87 0.273
0/272 21/7328 nan nan nan nan nan 101 0.273
0/272 22/7328 nan nan nan nan nan 96 0.273
0/272 23/7328 nan nan nan nan nan 124 0.273
0/272 24/7328 nan nan nan nan nan 89 0.273
0/272 25/7328 nan nan nan nan nan 152 0.273
To Reproduce
python train --cfg cfg/yolov3.cfg
Expected behavior
Simple training process, the loss is int maybe
Desktop (please complete the following information):
Additional context
Thank you for the suggestion
When I increase the batch_size, the same output is coming.
I try the yolov3-tiny. the output loss is normal
@JohnpaulCheng train.py is currently operating as expected. If you have environment issues try the GCP Quickstart:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart
sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3
python3 train.py
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)
Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
...
Epoch Batch xy wh conf cls total nTargets time
0/272 0/7328 2.04 1.34 130 13.1 146 75 5.23
0/272 1/7328 2.16 1.35 130 13.2 146 132 0.246
0/272 2/7328 2.1 1.32 130 13.2 146 130 0.238
0/272 3/7328 2.13 1.35 130 13.2 146 85 0.241
0/272 4/7328 2.15 1.34 130 13.3 147 101 0.243
0/272 5/7328 2.14 1.32 130 13.2 147 141 0.243
0/272 6/7328 2.12 1.31 130 13.3 146 100 0.248
0/272 7/7328 2.13 1.33 130 13.2 147 84 0.242
0/272 8/7328 2.12 1.31 130 13.2 146 88 0.24
0/272 9/7328 2.11 1.3 130 13.2 146 117 0.246
0/272 10/7328 2.12 1.3 130 13.2 146 106 0.241
0/272 11/7328 2.12 1.31 130 13.2 146 99 0.238
0/272 12/7328 2.13 1.31 130 13.3 146 108 0.237
0/272 13/7328 2.13 1.3 130 13.3 146 69 0.237
0/272 14/7328 2.12 1.29 130 13.3 146 139 0.237
0/272 15/7328 2.12 1.3 130 13.3 146 116 0.237
0/272 16/7328 2.12 1.3 130 13.3 146 135 0.237
0/272 17/7328 2.12 1.29 130 13.2 146 116 2.01
0/272 18/7328 2.13 1.3 130 13.2 146 139 0.314
0/272 19/7328 2.14 1.31 130 13.2 147 95 0.251
0/272 20/7328 2.14 1.3 130 13.2 146 150 0.242
0/272 21/7328 2.13 1.3 130 13.2 146 96 0.241
0/272 22/7328 2.13 1.3 130 13.2 146 150 0.241
0/272 23/7328 2.13 1.29 130 13.2 146 186 0.241
0/272 24/7328 2.13 1.29 130 13.2 146 187 0.241
0/272 25/7328 2.12 1.29 130 13.2 146 97 0.241
0/272 26/7328 2.12 1.29 130 13.2 146 130 0.243
0/272 27/7328 2.12 1.29 130 13.2 146 91 0.242
0/272 28/7328 2.12 1.29 130 13.2 146 89 0.245
0/272 29/7328 2.11 1.29 130 13.2 146 156 0.242
0/272 30/7328 2.11 1.29 130 13.2 146 113 0.241
0/272 31/7328 2.12 1.3 130 13.2 146 175 0.247
0/272 32/7328 2.11 1.3 130 13.2 146 125 0.242
0/272 33/7328 2.11 1.3 130 13.2 146 136 0.244
...
Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):
Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see https://github.com/AlexeyAB/darknet/issues/2309 and https://github.com/AlexeyAB/darknet/issues/2789)
Thanks for the first reply @glenn-jocher. I try it again but the problem is still happen on my own PC, Here is the detail of the setting. Maybe my setting or parameter is wrong. I find the output of model is 'inf' which is the input variable p of function compute_loss when I step into debug. So the loss is nan. Is it normal output when using the pre-trained weight darknet53.conv.74 ?
python train.py --cfg cfg/yolov3.cfg
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)
Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10981MB)
device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10989MB)
layer name gradient parameters shape mu sigma
0 module.0.conv_0.weight True 864 [32, 3, 3, 3] -0.00339 0.0648
1 module.0.batch_norm_0.weight True 32 [32] 0.987 1.07
2 module.0.batch_norm_0.bias True 32 [32] -0.698 2.07
...
...
221 module.105.conv_105.bias True 255 [255] -0.00154 0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients
Epoch Batch xy wh conf cls total nTargets time
Gtk-Message: 20:13:14.545: Failed to load module "atk-bridge"
Gtk-Message: 20:13:14.547: Failed to load module "canberra-gtk-module"
0/272 0/7328 9.76 inf nan nan nan 119 3.61
0/272 1/7328 10.6 nan nan nan nan 99 0.385
0/272 2/7328 10.6 nan nan nan nan 93 0.324
0/272 3/7328 10.5 nan nan nan nan 122 0.323
0/272 4/7328 10.8 nan nan nan nan 73 0.332
0/272 5/7328 10.7 nan nan nan nan 85 0.327
0/272 6/7328 10.8 nan nan nan nan 99 0.331
0/272 7/7328 10.8 nan nan nan nan 99 0.317
0/272 8/7328 10.9 nan nan nan nan 106 0.389
0/272 9/7328 11 nan nan nan nan 64 0.321
0/272 10/7328 11.1 nan nan nan nan 116 0.329
0/272 11/7328 11.1 nan nan nan nan 133 0.323
0/272 12/7328 11.1 nan nan nan nan 117 0.377
0/272 13/7328 11.1 nan nan nan nan 143 0.32
0/272 14/7328 11.2 nan nan nan nan 140 0.33
0/272 15/7328 11.1 nan nan nan nan 121 0.385
0/272 16/7328 11 nan nan nan nan 118 0.323
0/272 17/7328 11 nan nan nan nan 117 0.396
0/272 18/7328 11 nan nan nan nan 76 0.322
0/272 19/7328 11 nan nan nan nan 115 0.328
0/272 20/7328 11 nan nan nan nan 96 0.32
0/272 21/7328 10.9 nan nan nan nan 99 0.327
0/272 22/7328 10.9 nan nan nan nan 201 0.322
0/272 23/7328 10.9 nan nan nan nan 95 0.357
0/272 24/7328 11 nan nan nan nan 79 0.328
0/272 25/7328 11 nan nan nan nan 85 0.392
0/272 26/7328 11 nan nan nan nan 203 0.328
@JohnpaulCheng no its not normal, I've never seen that. Your problem is either:
Try to git clone a clean copy of the repo as in my previous post. If this doesn't work the problem is in your environment, and I can't help you with that obviously. Also, I wouldn't bother with yolov3.cfg, yolov3-spp.cfg works better (which is why we set it as the default choice).
I just tried multi-GPU again to verify since I saw you used it. Everything works perfectly if the environment and repo are set up correctly:
sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3
cd yolov3
python3 train.py
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)
Using CUDA device0 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
device1 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
--2019-04-16 12:47:25-- https://pjreddie.com/media/files/darknet53.conv.74
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 162482580 (155M) [application/octet-stream]
Saving to: ‘weights/darknet53.conv.74’
weights/darknet53.conv.74 100%[=============================================>] 154.96M 64.9MB/s in 2.4s
2019-04-16 12:47:27 (64.9 MB/s) - ‘weights/darknet53.conv.74’ saved [162482580/162482580]
layer name gradient parameters shape mu sigma
0 module.0.conv_0.weight True 864 [32, 3, 3, 3] -0.00339 0.0648
1 module.0.batch_norm_0.weight True 32 [32] 0.987 1.07
2 module.0.batch_norm_0.bias True 32 [32] -0.698 2.07
3 module.1.conv_1.weight True 18432 [64, 32, 3, 3] 0.000298 0.0177
...
224 module.112.conv_112.bias True 255 [255] -0.000773 0.0356
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients
Epoch Batch xy wh conf cls total nTargets time
0/272 0/7328 2.9 1.82 245 38.3 288 119 9.32
0/272 1/7328 2.95 1.89 245 38.2 288 96 0.671
0/272 2/7328 2.94 1.89 245 38.1 288 93 0.779
0/272 3/7328 2.99 1.94 245 38.1 288 123 0.644
0/272 4/7328 2.99 1.92 245 38 288 73 0.803
0/272 5/7328 2.97 1.94 245 38 288 83 0.696
0/272 6/7328 2.92 1.95 245 38 288 97 0.6
0/272 7/7328 2.89 1.94 245 38 288 98 0.629
0/272 8/7328 2.88 1.94 245 38 288 106 0.784
0/272 9/7328 2.89 1.92 245 38 288 69 0.778
0/272 10/7328 2.88 1.9 245 37.9 288 116 0.779
0/272 11/7328 2.88 1.88 245 37.9 287 134 0.705
0/272 12/7328 2.87 1.87 245 37.9 287 116 0.686
0/272 13/7328 2.88 1.85 245 37.9 287 139 0.679
0/272 14/7328 2.86 1.86 245 37.9 287 138 0.66
0/272 15/7328 2.87 1.87 245 37.9 287 114 0.707
0/272 16/7328 2.88 1.85 245 37.9 287 111 0.781
0/272 17/7328 2.86 1.84 245 37.9 287 114 0.772
0/272 18/7328 2.88 1.85 245 37.9 287 79 0.571
0/272 19/7328 2.87 1.84 245 37.9 287 114 0.776
0/272 20/7328 2.86 1.84 245 37.9 287 96 0.642
0/272 21/7328 2.87 1.84 245 37.9 287 99 0.636
0/272 22/7328 2.86 1.84 245 37.9 287 201 0.692
0/272 23/7328 2.86 1.84 245 37.9 287 96 0.621
0/272 24/7328 2.85 1.84 245 37.9 287 80 0.6
...
@glenn-jocher Thank you for your reply. I will try it again.
@glenn-jocher @drapado By chance I comment the pre-trained model loading. The training loss becomes int. I will train it for hours and confirm whether it converge or not. Thank you for your help!
Epoch Batch xy wh conf cls total nTargets time
0/272 0/7328 3.14 2.13 260 38.3 304 72 3.81
0/272 1/7328 2.94 2.04 260 38.5 303 130 0.276
0/272 2/7328 2.89 2.02 260 38.2 303 130 0.271
0/272 3/7328 2.91 2.06 260 38.2 303 83 0.271
0/272 4/7328 2.89 2.05 260 38.2 303 100 0.271
Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):
- You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
- Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See AlexeyAB/darknet#2783Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see AlexeyAB/darknet#2309 and AlexeyAB/darknet#2789)
@drapado,Can you tell me how small a object is going to cause nan?
Is it only objects with 1×1 size will cause the nan error?
Recently,I got a nan error.
According to your advice,I deleted the images having 1×1 object already.
I will try training tomorrow to see the impact.
Should I detele more images which have small objects?
Also,@glenn-jocher ,do you know the relationship between tiny objects and the nan error?
@chouxianyu @drapado labels below 5 pixels at --img-size in either dimension are rejected from training automatically:
https://github.com/ultralytics/yolov3/blob/84371f68117cae975eabfa78cdf8a2aa1b78e4ba/utils/datasets.py#L696
and boxes <=2 pixels are rejected during NMS as well:
https://github.com/ultralytics/yolov3/blob/84371f68117cae975eabfa78cdf8a2aa1b78e4ba/utils/utils.py#L513-L515
@glenn-jocher OK,Thanks for your reply!
Most helpful comment
Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):
Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See https://github.com/AlexeyAB/darknet/issues/2783
Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see https://github.com/AlexeyAB/darknet/issues/2309 and https://github.com/AlexeyAB/darknet/issues/2789)