Yolov3: 'nan' Train Problem

Created on 16 Apr 2019 · 10Comments · Source: ultralytics/yolov3

Describe the bug
The training loss is not expected
I don't know what happened, may my command is not ture.
Please give me some suggestion

   0/272     18/7328       nan       nan       nan       nan       nan       116     0.273
   0/272     19/7328       nan       nan       nan       nan       nan       119     0.273
   0/272     20/7328       nan       nan       nan       nan       nan        87     0.273
   0/272     21/7328       nan       nan       nan       nan       nan       101     0.273
   0/272     22/7328       nan       nan       nan       nan       nan        96     0.273
   0/272     23/7328       nan       nan       nan       nan       nan       124     0.273
   0/272     24/7328       nan       nan       nan       nan       nan        89     0.273
   0/272     25/7328       nan       nan       nan       nan       nan       152     0.273

To Reproduce
python train --cfg cfg/yolov3.cfg

Expected behavior
Simple training process, the loss is int maybe

Desktop (please complete the following information):

Python 3.6
- Pytroch 1.0
- RTX 2080ti

Additional context
Thank you for the suggestion

bug

Source

JohnpaulC

Most helpful comment

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See https://github.com/AlexeyAB/darknet/issues/2783

Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see https://github.com/AlexeyAB/darknet/issues/2309 and https://github.com/AlexeyAB/darknet/issues/2789)

drapado on 16 Apr 2019

👍2

All 10 comments

When I increase the batch_size, the same output is coming.
I try the yolov3-tiny. the output loss is normal

JohnpaulC on 16 Apr 2019

@JohnpaulCheng train.py is currently operating as expected. If you have environment issues try the GCP Quickstart:
https://github.com/ultralytics/yolov3/wiki/GCP-Quickstart

sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3 
python3 train.py
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='Tesla V100-SXM2-16GB', total_memory=16130MB)
...
  Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328      2.04      1.34       130      13.1       146        75      5.23
   0/272      1/7328      2.16      1.35       130      13.2       146       132     0.246
   0/272      2/7328       2.1      1.32       130      13.2       146       130     0.238
   0/272      3/7328      2.13      1.35       130      13.2       146        85     0.241
   0/272      4/7328      2.15      1.34       130      13.3       147       101     0.243
   0/272      5/7328      2.14      1.32       130      13.2       147       141     0.243
   0/272      6/7328      2.12      1.31       130      13.3       146       100     0.248
   0/272      7/7328      2.13      1.33       130      13.2       147        84     0.242
   0/272      8/7328      2.12      1.31       130      13.2       146        88      0.24
   0/272      9/7328      2.11       1.3       130      13.2       146       117     0.246
   0/272     10/7328      2.12       1.3       130      13.2       146       106     0.241
   0/272     11/7328      2.12      1.31       130      13.2       146        99     0.238
   0/272     12/7328      2.13      1.31       130      13.3       146       108     0.237
   0/272     13/7328      2.13       1.3       130      13.3       146        69     0.237
   0/272     14/7328      2.12      1.29       130      13.3       146       139     0.237
   0/272     15/7328      2.12       1.3       130      13.3       146       116     0.237
   0/272     16/7328      2.12       1.3       130      13.3       146       135     0.237
   0/272     17/7328      2.12      1.29       130      13.2       146       116      2.01
   0/272     18/7328      2.13       1.3       130      13.2       146       139     0.314
   0/272     19/7328      2.14      1.31       130      13.2       147        95     0.251
   0/272     20/7328      2.14       1.3       130      13.2       146       150     0.242
   0/272     21/7328      2.13       1.3       130      13.2       146        96     0.241
   0/272     22/7328      2.13       1.3       130      13.2       146       150     0.241
   0/272     23/7328      2.13      1.29       130      13.2       146       186     0.241
   0/272     24/7328      2.13      1.29       130      13.2       146       187     0.241
   0/272     25/7328      2.12      1.29       130      13.2       146        97     0.241
   0/272     26/7328      2.12      1.29       130      13.2       146       130     0.243
   0/272     27/7328      2.12      1.29       130      13.2       146        91     0.242
   0/272     28/7328      2.12      1.29       130      13.2       146        89     0.245
   0/272     29/7328      2.11      1.29       130      13.2       146       156     0.242
   0/272     30/7328      2.11      1.29       130      13.2       146       113     0.241
   0/272     31/7328      2.12       1.3       130      13.2       146       175     0.247
   0/272     32/7328      2.11       1.3       130      13.2       146       125     0.242
   0/272     33/7328      2.11       1.3       130      13.2       146       136     0.244
...

glenn-jocher on 16 Apr 2019

👎1 👍1

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

You have small objects in your dataset, maybe you labeled by error a 1x1 zone.
Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See https://github.com/AlexeyAB/darknet/issues/2783

drapado on 16 Apr 2019

👍2

Thanks for the first reply @glenn-jocher. I try it again but the problem is still happen on my own PC, Here is the detail of the setting. Maybe my setting or parameter is wrong. I find the output of model is 'inf' which is the input variable p of function compute_loss when I step into debug. So the loss is nan. Is it normal output when using the pre-trained weight darknet53.conv.74 ?

python train.py --cfg cfg/yolov3.cfg                                               
Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10981MB)
           device1 _CudaDeviceProperties(name='GeForce RTX 2080 Ti', total_memory=10989MB)

layer                                     name  gradient   parameters                shape         mu      sigma
    0                   module.0.conv_0.weight      True          864        [32, 3, 3, 3]   -0.00339     0.0648
    1             module.0.batch_norm_0.weight      True           32                 [32]      0.987       1.07
    2               module.0.batch_norm_0.bias      True           32                 [32]     -0.698       2.07
   ...
   ...
  221                 module.105.conv_105.bias      True          255                [255]   -0.00154      0.036
Model Summary: 222 layers, 6.19491e+07 parameters, 6.19491e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
Gtk-Message: 20:13:14.545: Failed to load module "atk-bridge"
Gtk-Message: 20:13:14.547: Failed to load module "canberra-gtk-module"
   0/272      0/7328      9.76       inf       nan       nan       nan       119      3.61
   0/272      1/7328      10.6       nan       nan       nan       nan        99     0.385
   0/272      2/7328      10.6       nan       nan       nan       nan        93     0.324
   0/272      3/7328      10.5       nan       nan       nan       nan       122     0.323
   0/272      4/7328      10.8       nan       nan       nan       nan        73     0.332
   0/272      5/7328      10.7       nan       nan       nan       nan        85     0.327
   0/272      6/7328      10.8       nan       nan       nan       nan        99     0.331
   0/272      7/7328      10.8       nan       nan       nan       nan        99     0.317
   0/272      8/7328      10.9       nan       nan       nan       nan       106     0.389
   0/272      9/7328        11       nan       nan       nan       nan        64     0.321
   0/272     10/7328      11.1       nan       nan       nan       nan       116     0.329
   0/272     11/7328      11.1       nan       nan       nan       nan       133     0.323
   0/272     12/7328      11.1       nan       nan       nan       nan       117     0.377
   0/272     13/7328      11.1       nan       nan       nan       nan       143      0.32
   0/272     14/7328      11.2       nan       nan       nan       nan       140      0.33
   0/272     15/7328      11.1       nan       nan       nan       nan       121     0.385
   0/272     16/7328        11       nan       nan       nan       nan       118     0.323
   0/272     17/7328        11       nan       nan       nan       nan       117     0.396
   0/272     18/7328        11       nan       nan       nan       nan        76     0.322
   0/272     19/7328        11       nan       nan       nan       nan       115     0.328
   0/272     20/7328        11       nan       nan       nan       nan        96      0.32
   0/272     21/7328      10.9       nan       nan       nan       nan        99     0.327
   0/272     22/7328      10.9       nan       nan       nan       nan       201     0.322
   0/272     23/7328      10.9       nan       nan       nan       nan        95     0.357
   0/272     24/7328        11       nan       nan       nan       nan        79     0.328
   0/272     25/7328        11       nan       nan       nan       nan        85     0.392
   0/272     26/7328        11       nan       nan       nan       nan       203     0.328

JohnpaulC on 16 Apr 2019

@JohnpaulCheng no its not normal, I've never seen that. Your problem is either:

you inadvertently introduced a bug when modifying the default repository (happens often)
your coco download was corrupted (unlikely)
your custom data (happens often, but irrelevant here since you are on coco)
your environment (likely)

Try to git clone a clean copy of the repo as in my previous post. If this doesn't work the problem is in your environment, and I can't help you with that obviously. Also, I wouldn't bother with yolov3.cfg, yolov3-spp.cfg works better (which is why we set it as the default choice).

I just tried multi-GPU again to verify since I saw you used it. Everything works perfectly if the environment and repo are set up correctly:

sudo rm -rf yolov3
git clone https://github.com/ultralytics/yolov3 
cd yolov3
python3 train.py

Namespace(accumulate=1, backend='nccl', batch_size=16, cfg='cfg/yolov3-spp.cfg', data_cfg='data/coco.data', dist_url='tcp://127.0.0.1:9999', epochs=273, img_size=416, multi_scale=False, nosave=False, num_workers=4, rank=0, resume=False, transfer=False, world_size=1)

Using CUDA device0 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)
           device1 _CudaDeviceProperties(name='Tesla P4', total_memory=7611MB)

--2019-04-16 12:47:25--  https://pjreddie.com/media/files/darknet53.conv.74
Resolving pjreddie.com (pjreddie.com)... 128.208.4.108
Connecting to pjreddie.com (pjreddie.com)|128.208.4.108|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 162482580 (155M) [application/octet-stream]
Saving to: ‘weights/darknet53.conv.74’
weights/darknet53.conv.74    100%[=============================================>] 154.96M  64.9MB/s    in 2.4s    
2019-04-16 12:47:27 (64.9 MB/s) - ‘weights/darknet53.conv.74’ saved [162482580/162482580]

layer                                     name  gradient   parameters                shape         mu      sigma
    0                   module.0.conv_0.weight      True          864        [32, 3, 3, 3]   -0.00339     0.0648
    1             module.0.batch_norm_0.weight      True           32                 [32]      0.987       1.07
    2               module.0.batch_norm_0.bias      True           32                 [32]     -0.698       2.07
    3                   module.1.conv_1.weight      True        18432       [64, 32, 3, 3]   0.000298     0.0177
...
  224                 module.112.conv_112.bias      True          255                [255]  -0.000773     0.0356
Model Summary: 225 layers, 6.29987e+07 parameters, 6.29987e+07 gradients

   Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328       2.9      1.82       245      38.3       288       119      9.32
   0/272      1/7328      2.95      1.89       245      38.2       288        96     0.671
   0/272      2/7328      2.94      1.89       245      38.1       288        93     0.779
   0/272      3/7328      2.99      1.94       245      38.1       288       123     0.644
   0/272      4/7328      2.99      1.92       245        38       288        73     0.803
   0/272      5/7328      2.97      1.94       245        38       288        83     0.696
   0/272      6/7328      2.92      1.95       245        38       288        97       0.6
   0/272      7/7328      2.89      1.94       245        38       288        98     0.629
   0/272      8/7328      2.88      1.94       245        38       288       106     0.784
   0/272      9/7328      2.89      1.92       245        38       288        69     0.778
   0/272     10/7328      2.88       1.9       245      37.9       288       116     0.779
   0/272     11/7328      2.88      1.88       245      37.9       287       134     0.705
   0/272     12/7328      2.87      1.87       245      37.9       287       116     0.686
   0/272     13/7328      2.88      1.85       245      37.9       287       139     0.679
   0/272     14/7328      2.86      1.86       245      37.9       287       138      0.66
   0/272     15/7328      2.87      1.87       245      37.9       287       114     0.707
   0/272     16/7328      2.88      1.85       245      37.9       287       111     0.781
   0/272     17/7328      2.86      1.84       245      37.9       287       114     0.772
   0/272     18/7328      2.88      1.85       245      37.9       287        79     0.571
   0/272     19/7328      2.87      1.84       245      37.9       287       114     0.776
   0/272     20/7328      2.86      1.84       245      37.9       287        96     0.642
   0/272     21/7328      2.87      1.84       245      37.9       287        99     0.636
   0/272     22/7328      2.86      1.84       245      37.9       287       201     0.692
   0/272     23/7328      2.86      1.84       245      37.9       287        96     0.621
   0/272     24/7328      2.85      1.84       245      37.9       287        80       0.6
...

glenn-jocher on 16 Apr 2019

👎1 👍1

@glenn-jocher Thank you for your reply. I will try it again.

JohnpaulC on 16 Apr 2019

@glenn-jocher @drapado By chance I comment the pre-trained model loading. The training loss becomes int. I will train it for hours and confirm whether it converge or not. Thank you for your help!

Epoch       Batch        xy        wh      conf       cls     total  nTargets      time
   0/272      0/7328      3.14      2.13       260      38.3       304        72      3.81
   0/272      1/7328      2.94      2.04       260      38.5       303       130     0.276
   0/272      2/7328      2.89      2.02       260      38.2       303       130     0.271
   0/272      3/7328      2.91      2.06       260      38.2       303        83     0.271
   0/272      4/7328      2.89      2.05       260      38.2       303       100     0.271

JohnpaulC on 17 Apr 2019

Hi @JohnpaulCheng from my experience this can be caused by differenct factors, specially if you have small objects in your dataset (smaller than 16x16 when resizing):

You have small objects in your dataset, maybe you labeled by error a 1x1 zone.

Can be related to mixed precision training, again if you have small objects in your dataset. Do you use mixed_precision=True in train.py?
Try with yolov3-5l or yolov3-spp. Also try to set mixed_prectision=False if you are using it and see if the error disappear. See AlexeyAB/darknet#2783

Another note, some RTX 2080 have problems with micron memory and overheating, it can be caused by that (see AlexeyAB/darknet#2309 and AlexeyAB/darknet#2789)

@drapado,Can you tell me how small a object is going to cause nan?
Is it only objects with 1×1 size will cause the nan error？
Recently,I got a nan error.
According to your advice，I deleted the images having 1×1 object already.
I will try training tomorrow to see the impact.
Should I detele more images which have small objects?
Also,@glenn-jocher ,do you know the relationship between tiny objects and the nan error?