Yolov3: 'nan' Train Problem

Created on 16 Apr 2019  Â·  10Comments  Â·  Source: ultralytics/yolov3

Describe the bug
The training loss is not expected
I don't know what happened, may my command is not ture.
Please give me some suggestion

   0/272     18/7328       nan       nan       nan       nan       nan       116     0.273
   0/272     19/7328       nan       nan       nan       nan       nan       119     0.273
   0/272     20/7328       nan       nan       nan       nan       nan        87     0.273
   0/272     21/7328       nan       nan       nan       nan       nan       101     0.273
   0/272     22/7328       nan       nan       nan       nan       nan        96     0.273
   0/272     23/7328       nan       nan       nan       nan       nan       124     0.273
   0/272     24/7328       nan       nan       nan       nan       nan        89     0.273
   0/272     25/7328       nan       nan       nan       nan       nan       152     0.273

To Reproduce
python train --cfg cfg/yolov3.cfg

Expected behavior
Simple training process, the loss is int maybe

Desktop (please complete the following information):

  • Python 3.6

    • Pytroch 1.0

    • RTX 2080ti

Additional context
Thank you for the suggestion

bug

Most helpful comment

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

  • You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
  • Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
    Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See https://github.com/AlexeyAB/darknet/issues/2783

Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see https://github.com/AlexeyAB/darknet/issues/2309 and https://github.com/AlexeyAB/darknet/issues/2789)

All 10 comments

When I increase the batch_size, the same output is coming.
I try the yolov3-tiny. the output loss is normal

@JohnpaulCheng train.py is currently operating as expected. If you have environment issues try the GCP Quickstart:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart

sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3 
python3 train.py
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
...
  Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328      2.04      1.34       130      13.1       146        75      5.23
   0/272      1/7328      2.16      1.35       130      13.2       146       132     0.246
   0/272      2/7328       2.1      1.32       130      13.2       146       130     0.238
   0/272      3/7328      2.13      1.35       130      13.2       146        85     0.241
   0/272      4/7328      2.15      1.34       130      13.3       147       101     0.243
   0/272      5/7328      2.14      1.32       130      13.2       147       141     0.243
   0/272      6/7328      2.12      1.31       130      13.3       146       100     0.248
   0/272      7/7328      2.13      1.33       130      13.2       147        84     0.242
   0/272      8/7328      2.12      1.31       130      13.2       146        88      0.24
   0/272      9/7328      2.11       1.3       130      13.2       146       117     0.246
   0/272     10/7328      2.12       1.3       130      13.2       146       106     0.241
   0/272     11/7328      2.12      1.31       130      13.2       146        99     0.238
   0/272     12/7328      2.13      1.31       130      13.3       146       108     0.237
   0/272     13/7328      2.13       1.3       130      13.3       146        69     0.237
   0/272     14/7328      2.12      1.29       130      13.3       146       139     0.237
   0/272     15/7328      2.12       1.3       130      13.3       146       116     0.237
   0/272     16/7328      2.12       1.3       130      13.3       146       135     0.237
   0/272     17/7328      2.12      1.29       130      13.2       146       116      2.01
   0/272     18/7328      2.13       1.3       130      13.2       146       139     0.314
   0/272     19/7328      2.14      1.31       130      13.2       147        95     0.251
   0/272     20/7328      2.14       1.3       130      13.2       146       150     0.242
   0/272     21/7328      2.13       1.3       130      13.2       146        96     0.241
   0/272     22/7328      2.13       1.3       130      13.2       146       150     0.241
   0/272     23/7328      2.13      1.29       130      13.2       146       186     0.241
   0/272     24/7328      2.13      1.29       130      13.2       146       187     0.241
   0/272     25/7328      2.12      1.29       130      13.2       146        97     0.241
   0/272     26/7328      2.12      1.29       130      13.2       146       130     0.243
   0/272     27/7328      2.12      1.29       130      13.2       146        91     0.242
   0/272     28/7328      2.12      1.29       130      13.2       146        89     0.245
   0/272     29/7328      2.11      1.29       130      13.2       146       156     0.242
   0/272     30/7328      2.11      1.29       130      13.2       146       113     0.241
   0/272     31/7328      2.12       1.3       130      13.2       146       175     0.247
   0/272     32/7328      2.11       1.3       130      13.2       146       125     0.242
   0/272     33/7328      2.11       1.3       130      13.2       146       136     0.244
...

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

  • You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
  • Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
    Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See https://github.com/AlexeyAB/darknet/issues/2783

Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see https://github.com/AlexeyAB/darknet/issues/2309 and https://github.com/AlexeyAB/darknet/issues/2789)

Thanks for the first reply @glenn-jocher. I try it again but the problem is still happen on my own PC, Here is the detail of the setting. Maybe my setting or parameter is wrong. I find the output of model is 'inf' which is the input variable p of function compute_loss when I step into debug. So the loss is nan. Is it normal output when using the pre-trained weight darknet53.conv.74 ?

python train.py --cfg cfg/yolov3.cfg                                               
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10981MB)
           device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10989MB)

layer                                     name  gradient   parameters                shape         mu      sigma
    0                   module.0.conv_0.weight      True          864        [32, 3, 3, 3]   -0.00339     0.0648
    1             module.0.batch_norm_0.weight      True           32                 [32]      0.987       1.07
    2               module.0.batch_norm_0.bias      True           32                 [32]     -0.698       2.07
   ...
   ...
  221                 module.105.conv_105.bias      True          255                [255]   -0.00154      0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
Gtk-Message: 20:13:14.545: Failed to load module "atk-bridge"
Gtk-Message: 20:13:14.547: Failed to load module "canberra-gtk-module"
   0/272      0/7328      9.76       inf       nan       nan       nan       119      3.61
   0/272      1/7328      10.6       nan       nan       nan       nan        99     0.385
   0/272      2/7328      10.6       nan       nan       nan       nan        93     0.324
   0/272      3/7328      10.5       nan       nan       nan       nan       122     0.323
   0/272      4/7328      10.8       nan       nan       nan       nan        73     0.332
   0/272      5/7328      10.7       nan       nan       nan       nan        85     0.327
   0/272      6/7328      10.8       nan       nan       nan       nan        99     0.331
   0/272      7/7328      10.8       nan       nan       nan       nan        99     0.317
   0/272      8/7328      10.9       nan       nan       nan       nan       106     0.389
   0/272      9/7328        11       nan       nan       nan       nan        64     0.321
   0/272     10/7328      11.1       nan       nan       nan       nan       116     0.329
   0/272     11/7328      11.1       nan       nan       nan       nan       133     0.323
   0/272     12/7328      11.1       nan       nan       nan       nan       117     0.377
   0/272     13/7328      11.1       nan       nan       nan       nan       143      0.32
   0/272     14/7328      11.2       nan       nan       nan       nan       140      0.33
   0/272     15/7328      11.1       nan       nan       nan       nan       121     0.385
   0/272     16/7328        11       nan       nan       nan       nan       118     0.323
   0/272     17/7328        11       nan       nan       nan       nan       117     0.396
   0/272     18/7328        11       nan       nan       nan       nan        76     0.322
   0/272     19/7328        11       nan       nan       nan       nan       115     0.328
   0/272     20/7328        11       nan       nan       nan       nan        96      0.32
   0/272     21/7328      10.9       nan       nan       nan       nan        99     0.327
   0/272     22/7328      10.9       nan       nan       nan       nan       201     0.322
   0/272     23/7328      10.9       nan       nan       nan       nan        95     0.357
   0/272     24/7328        11       nan       nan       nan       nan        79     0.328
   0/272     25/7328        11       nan       nan       nan       nan        85     0.392
   0/272     26/7328        11       nan       nan       nan       nan       203     0.328

@JohnpaulCheng no its not normal, I've never seen that. Your problem is either:

  • you inadvertently introduced a bug when modifying the default repository (happens often)
  • your coco download was corrupted (unlikely)
  • your custom data (happens often, but irrelevant here since you are on coco)
  • your environment (likely)

Try to git clone a clean copy of the repo as in my previous post. If this doesn't work the problem is in your environment, and I can't help you with that obviously. Also, I wouldn't bother with yolov3.cfg, yolov3-spp.cfg works better (which is why we set it as the default choice).

I just tried multi-GPU again to verify since I saw you used it. Everything works perfectly if the environment and repo are set up correctly:

sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3 
cd yolov3
python3 train.py

Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
           device1 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)

--2019-04-16 12:47:25--  https://pjreddie.com/media/files/darknet53.conv.74
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 162482580 (155M) [application/octet-stream]
Saving to: ‘weights/darknet53.conv.74’
weights/darknet53.conv.74    100%[=============================================>] 154.96M  64.9MB/s    in 2.4s    
2019-04-16 12:47:27 (64.9 MB/s) - ‘weights/darknet53.conv.74’ saved [162482580/162482580]

layer                                     name  gradient   parameters                shape         mu      sigma
    0                   module.0.conv_0.weight      True          864        [32, 3, 3, 3]   -0.00339     0.0648
    1             module.0.batch_norm_0.weight      True           32                 [32]      0.987       1.07
    2               module.0.batch_norm_0.bias      True           32                 [32]     -0.698       2.07
    3                   module.1.conv_1.weight      True        18432       [64, 32, 3, 3]   0.000298     0.0177
...
  224                 module.112.conv_112.bias      True          255                [255]  -0.000773     0.0356
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328       2.9      1.82       245      38.3       288       119      9.32
   0/272      1/7328      2.95      1.89       245      38.2       288        96     0.671
   0/272      2/7328      2.94      1.89       245      38.1       288        93     0.779
   0/272      3/7328      2.99      1.94       245      38.1       288       123     0.644
   0/272      4/7328      2.99      1.92       245        38       288        73     0.803
   0/272      5/7328      2.97      1.94       245        38       288        83     0.696
   0/272      6/7328      2.92      1.95       245        38       288        97       0.6
   0/272      7/7328      2.89      1.94       245        38       288        98     0.629
   0/272      8/7328      2.88      1.94       245        38       288       106     0.784
   0/272      9/7328      2.89      1.92       245        38       288        69     0.778
   0/272     10/7328      2.88       1.9       245      37.9       288       116     0.779
   0/272     11/7328      2.88      1.88       245      37.9       287       134     0.705
   0/272     12/7328      2.87      1.87       245      37.9       287       116     0.686
   0/272     13/7328      2.88      1.85       245      37.9       287       139     0.679
   0/272     14/7328      2.86      1.86       245      37.9       287       138      0.66
   0/272     15/7328      2.87      1.87       245      37.9       287       114     0.707
   0/272     16/7328      2.88      1.85       245      37.9       287       111     0.781
   0/272     17/7328      2.86      1.84       245      37.9       287       114     0.772
   0/272     18/7328      2.88      1.85       245      37.9       287        79     0.571
   0/272     19/7328      2.87      1.84       245      37.9       287       114     0.776
   0/272     20/7328      2.86      1.84       245      37.9       287        96     0.642
   0/272     21/7328      2.87      1.84       245      37.9       287        99     0.636
   0/272     22/7328      2.86      1.84       245      37.9       287       201     0.692
   0/272     23/7328      2.86      1.84       245      37.9       287        96     0.621
   0/272     24/7328      2.85      1.84       245      37.9       287        80       0.6
...

@glenn-jocher Thank you for your reply. I will try it again.

@glenn-jocher @drapado By chance I comment the pre-trained model loading. The training loss becomes int. I will train it for hours and confirm whether it converge or not. Thank you for your help!

Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328      3.14      2.13       260      38.3       304        72      3.81
   0/272      1/7328      2.94      2.04       260      38.5       303       130     0.276
   0/272      2/7328      2.89      2.02       260      38.2       303       130     0.271
   0/272      3/7328      2.91      2.06       260      38.2       303        83     0.271
   0/272      4/7328      2.89      2.05       260      38.2       303       100     0.271

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

  • You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
  • Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
    Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See AlexeyAB/darknet#2783

Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see AlexeyAB/darknet#2309 and AlexeyAB/darknet#2789)

@drapado,Can you tell me how small a object is going to cause nan?
Is it only objects with 1×1 size will cause the nan error?
Recently,I got a nan error.
According to your advice,I deleted the images having 1×1 object already.
I will try training tomorrow to see the impact.
Should I detele more images which have small objects?
Also,@glenn-jocher ,do you know the relationship between tiny objects and the nan error?

@chouxianyu @drapado labels below 5 pixels at --img-size in either dimension are rejected from training automatically:
https://github.com/ultralytics/yolov3/blob/84371f68117cae975eabfa78cdf8a2aa1b78e4ba/utils/datasets.py#L696

and boxes <=2 pixels are rejected during NMS as well:
https://github.com/ultralytics/yolov3/blob/84371f68117cae975eabfa78cdf8a2aa1b78e4ba/utils/utils.py#L513-L515

@glenn-jocher OK,Thanks for your reply!

Was this page helpful?
0 / 5 - 0 ratings