hi, I met something weird. when i set distributed dataparallel mode in multiple gpus, the giou loss and obj loss at first decreases and suddenly becomes nan, but when i train the model in single gpu, the loss decreases all the time. The batchsizes are the same.
what may be the reason?
Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:
git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:sudo rm -rf yolov5 # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.
If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!
Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:
$ pip install -r requirements.txt
YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):
If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.
I will try. Thanks for your response.
hi
I find the nan value comes from the computation of ciou loss.
https://github.com/ultralytics/yolov5/blob/master/utils/general.py#L380
v = (4 / math.pi ** 2) * torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)
During my training, there exists h1 equal to 0. But i'm not sure why this happened.
@bxhandhxb h1 is the target box height (from your labels). zero height labels are filtered out first during label caching and then secondly during training after the images and labels are augmented.
In any case, I just tested a torch.atan() with divide by zero and the output is pi/2, so it is not responsible for your nan:
import torch
torch.atan(torch.tensor(1.) / torch.tensor([0.]))
tensor([1.5708])
sorry for the late reply. I read your codes carefully.
I think w1 and h1 are the predicted box width and height.
And when h1 and w1 are both zeros, the nan occurs.

@bxhandhxb ah I see. 0 / 0 = nan, 1 / 0 = inf in torch. w1 and h1 come from box1, which the predicted box. So we want to add 1E-16 to all box1 widths and heights then for protected division.
@bxhandhxb I've added an eps term to the IoU function in https://github.com/ultralytics/yolov5/commit/5e0b90de8f7782b3803fa2886bb824c2336358d0, which adds 1e-12 to each box x2, y2. This should ensure that neither box1 nor box2 ever have zero widths or heights.
I believe this should make the function much more robust. Please git pull or clone a new copy and try again.
@glenn-jocher thank you ~
@bxhandhxb you're welcome! Try your same training with a new git clone and see if the error is resolved.
@glenn-jocher unfortunately, something weird occurs.

maybe because of the fp16 training ? ?
I use the following training script
python -m torch.distributed.launch --nproc_per_node 6 train.py --img-size 1920 --batch-size 48 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 1,2,3,5,6,7
maybe I should pull your docker image and retry......
@bxhandhxb is your loss computation done doing fp16 or fp32?
It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.
It seems like even adding a 0.1 eps value to 1000 will have no effect in fp16.
x = torch.zeros(1) + 1000.
(x.half() + 0.1) - x.half()
Out[29]: tensor([0.], dtype=torch.float16)
I just run your codes and didn't modify anything. I don't know how the torch.cuda.amp works.......
btw, my environment is python 3.7.7+torch 1.6.0 because there is no python3.8 docker image in https://hub.docker.com/r/pytorch/pytorch/tags.
oh fp32 can only ensure the 6 decimal digits of precision. and fp16 can only ensure the 3 decimal digits of precision. The above calculation is correct. I set eps to 1e-6 and it works.
@bxhandhxb is your loss computation done doing fp16 or fp32?
It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.
I got it. 馃樃 thanks
@bxhandhxb oh, great, it works!
Still, I think I should move the eps into the fraction denominator, because there if the denominator is 0, we don't have to worry about eps losing precision. The way I have it set up now is to add eps to the x2 y2 of each box, but if these values are already very large, say 10 or 100, then eps will 'disappear' having no effect, especially for fp16 ops. Does this make sense?
TODO: Move eps into fraction denominators for IoU calculations.
@bxhandhxb pushed https://github.com/ultralytics/yolov5/commit/5a7d79fbe667c3162d7eacf3f65ab5ff7ef9576f to resolve remaining nan issue on training. Please git pull and try again, and let me know if you see anymore nan's appear in training.
Removing TODO, assuming resolved.
@glenn-jocher hi
I train from scratch for 3 times. Each time I train about 400 iterations and the total loss approximately decreased from 0.18 to 0.145. No nan loss. The following is my training command. I think this problem has been solved.
python -m torch.distributed.launch --nproc_per_node 4 train.py --img-size 1920 --batch-size 32 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 0,4,5,7
But I find the training speed is very slow. I will open a new issue to describe it in detail. 馃槀
Thanks for your help.
@bxhandhxb oh great, nan's have been successfully banished :)