Yolov5: when using distributed dataparallel mode, the giou loss and obj loss becomes nan

Created on 14 Aug 2020 · 22Comments · Source: ultralytics/yolov5

❔Question

hi, I met something weird. when i set distributed dataparallel mode in multiple gpus, the giou loss and obj loss at first decreases and suddenly becomes nan, but when i train the model in single gpu, the loss decreases all the time. The batchsizes are the same.
what may be the reason?

Additional context

question

Source

bxhandhxb

All 22 comments

Hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a new git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov5  # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py  # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab Notebook with free GPU:
Kaggle Notebook with free GPU: https://www.kaggle.com/ultralytics/yolov5
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Docker Image https://hub.docker.com/r/ultralytics/yolov5. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

glenn-jocher on 14 Aug 2020

I will try. Thanks for your response.

bxhandhxb on 14 Aug 2020

hi
I find the nan value comes from the computation of ciou loss.

https://github.com/ultralytics/yolov5/blob/master/utils/general.py#L380
v = (4 / math.pi ** 2) * torch.pow(torch.atan(w2 / h2) - torch.atan(w1 / h1), 2)

During my training, there exists h1 equal to 0. But i'm not sure why this happened.

bxhandhxb on 14 Aug 2020

@bxhandhxb h1 is the target box height (from your labels). zero height labels are filtered out first during label caching and then secondly during training after the images and labels are augmented.

In any case, I just tested a torch.atan() with divide by zero and the output is pi/2, so it is not responsible for your nan:

import torch
torch.atan(torch.tensor(1.) / torch.tensor([0.]))
tensor([1.5708])

glenn-jocher on 14 Aug 2020

sorry for the late reply. I read your codes carefully.
I think w1 and h1 are the predicted box width and height.
And when h1 and w1 are both zeros, the nan occurs.

截屏2020-08-20 下午4 22 09

bxhandhxb on 20 Aug 2020

@bxhandhxb ah I see. 0 / 0 = nan, 1 / 0 = inf in torch. w1 and h1 come from box1, which the predicted box. So we want to add 1E-16 to all box1 widths and heights then for protected division.

glenn-jocher on 20 Aug 2020

@bxhandhxb I've added an eps term to the IoU function in https://github.com/ultralytics/yolov5/commit/5e0b90de8f7782b3803fa2886bb824c2336358d0, which adds 1e-12 to each box x2, y2. This should ensure that neither box1 nor box2 ever have zero widths or heights.

I believe this should make the function much more robust. Please git pull or clone a new copy and try again.

glenn-jocher on 20 Aug 2020

@glenn-jocher thank you ~

bxhandhxb on 21 Aug 2020

@bxhandhxb you're welcome! Try your same training with a new git clone and see if the error is resolved.

glenn-jocher on 21 Aug 2020

@glenn-jocher unfortunately, something weird occurs.

截屏2020-08-21 下午3 42 46

maybe because of the fp16 training ? ?

bxhandhxb on 21 Aug 2020

I use the following training script

python -m torch.distributed.launch --nproc_per_node 6 train.py --img-size 1920 --batch-size 48 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 1,2,3,5,6,7

bxhandhxb on 21 Aug 2020

maybe I should pull your docker image and retry......

bxhandhxb on 21 Aug 2020

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

glenn-jocher on 21 Aug 2020

👍1

It seems like even adding a 0.1 eps value to 1000 will have no effect in fp16.

x = torch.zeros(1) + 1000.
(x.half() + 0.1) - x.half()
Out[29]: tensor([0.], dtype=torch.float16)

glenn-jocher on 21 Aug 2020

I just run your codes and didn't modify anything. I don't know how the torch.cuda.amp works.......
btw, my environment is python 3.7.7+torch 1.6.0 because there is no python3.8 docker image in https://hub.docker.com/r/pytorch/pytorch/tags.

bxhandhxb on 21 Aug 2020

oh fp32 can only ensure the 6 decimal digits of precision. and fp16 can only ensure the 3 decimal digits of precision. The above calculation is correct. I set eps to 1e-6 and it works.

bxhandhxb on 21 Aug 2020

@bxhandhxb is your loss computation done doing fp16 or fp32?

It's possible eps may need to be set larger, or perhaps the eps values should be moved directly into the division denominators so they don't lose precision when added to larger numbers.

I got it. 😸 thanks

bxhandhxb on 21 Aug 2020

@bxhandhxb oh, great, it works!

Still, I think I should move the eps into the fraction denominator, because there if the denominator is 0, we don't have to worry about eps losing precision. The way I have it set up now is to add eps to the x2 y2 of each box, but if these values are already very large, say 10 or 100, then eps will 'disappear' having no effect, especially for fp16 ops. Does this make sense?

glenn-jocher on 21 Aug 2020

👍1

TODO: Move eps into fraction denominators for IoU calculations.

glenn-jocher on 21 Aug 2020

@bxhandhxb pushed https://github.com/ultralytics/yolov5/commit/5a7d79fbe667c3162d7eacf3f65ab5ff7ef9576f to resolve remaining nan issue on training. Please git pull and try again, and let me know if you see anymore nan's appear in training.

Removing TODO, assuming resolved.

glenn-jocher on 25 Aug 2020

@glenn-jocher hi
I train from scratch for 3 times. Each time I train about 400 iterations and the total loss approximately decreased from 0.18 to 0.145. No nan loss. The following is my training command. I think this problem has been solved.

python -m torch.distributed.launch --nproc_per_node 4 train.py --img-size 1920 --batch-size 32 --data ./data/mydata.yaml --cfg ./models/yolov5s.yaml --weights '' --device 0,4,5,7

But I find the training speed is very slow. I will open a new issue to describe it in detail. 😂
Thanks for your help.

bxhandhxb on 27 Aug 2020

@bxhandhxb oh great, nan's have been successfully banished :)

glenn-jocher on 27 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings