Yolov3: WARNING: non-finite loss, ending training tensor([ nan, 0.53250, 1.26820, nan], device='cuda:0') on custom data set

Created on 8 Oct 2019 · 9Comments · Source: ultralytics/yolov3

Describe the bug
WARNING: non-finite loss, ending training tensor([ nan, 0.53250, 1.26820, nan], device='cuda:0') ...error on EPOCH 2/272 while attempting to train my own custom dataset (with 2 classes) using ultralytics and the yolo3.cfg.

Screen shot:
https://gyazo.com/6e7828ef7f0473fd0b955682bc4421b4

I'm using only 2 custom classes in my roulette.names, which are:

Ball
Zero

roulette.data contains paths:

classes = 2
train = data/train.txt
valid = data/test.txt
names = data/roulette.names
backup = backup/roulette3

To Reproduce
Steps to reproduce the behavior:

I took the yolov3.cfg and renamed it to roulette3.cfg
I changed:
batches = 8 (using anything > 8 batches gave me a memory error on my GTX1060 6GB for some reason)
subdivisions to 16
max_batches to 4000
steps=3200,3600
classes=2
filters=21
Added 697 images to the data/images folder and created a data/labels folder where I placed the corresponding label .txts.
The train.txt and test.txt are in the data directory
roulette.names (Ball, Zero) and roulette.data are in the data directory.
I used this command in the Windows 10 x64 command prompt to attempt to train my data before receiving the error:
python train.py --data data/roulette.data --cfg cfg/roulette3.cfg --batch-size 8

Expected behavior
I expected to be able to train my data.

Screenshots
Screen shot:
https://gyazo.com/6e7828ef7f0473fd0b955682bc4421b4

Desktop (please complete the following information):

OS: Windows 10 Pro 64 bit
Version 10..0.18362 Build 18362
- GPU: Gigabyte GTX1060 6GB

Smartphone (please complete the following information):
N/A

Additional context
I'm just trying to train custom data of a roulette wheel and ball, I want to be able to detect the ball and roulette wheel ONLY and I've already annotated my own custom data with those 2 classes. Am I missing something here?

bug

Source

alpizano

Most helpful comment

Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?

You were spot on with the GIOU weight. I reduced LR to 0.0001 but it made no difference, but editing GIOU weight of train.py (line 19) was the fix and now I'm able to complete all epochs of training with no "non-finite loss" etc... error. (Reduced GIOU from 1.582 to 1)

Hopefully this helps anyone else out if there were having a similar error.

Thank you very much sir ❤

alpizano on 20 Oct 2019

👍12 🎉4

All 9 comments

Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?

glenn-jocher on 16 Oct 2019

👍3 🎉2 🚀1

Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?

Hopefully this helps anyone else out if there were having a similar error.

Thank you very much sir ❤

alpizano on 20 Oct 2019

👍12 🎉4

@alpizano hmmm awesome!! It would be nice to somehow make the GIoU loss more robust to this but I can't figure out an easy way to do it. Anyway, good to know lowering the weight fixed it.

glenn-jocher on 25 Oct 2019

Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?

You were spot on with the GIOU weight. I reduced LR to 0.0001 but it made no difference, but editing GIOU weight of train.py (line 19) was the fix and now I'm able to complete all epochs of training with no "non-finite loss" etc... error. (Reduced GIOU from 1.582 to 1)

Hopefully this helps anyone else out if there were having a similar error.

Thank you very much sir ❤

I have encountered the same problem. When I turn the GIoU down to 1, the problem is still there. Then I lowered the learning rate, the problem was fixed. I don't know why, maybe the hyperparameters' setting are related to the training datasets. Thank you for your help.

zhuxiang on 2 Nov 2019

Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?

I lowered LR and GIOU, then the problem was fixed. Thank you.

zhuxiang on 2 Nov 2019

I had a similar issue, spotted it while experimenting with the focal loss. I had a nan for the objectness loss. It was caused by setting the targets for the objectness measure equal to the giou, however the giou can be between -1 and +1 and not between 0 and +1. If inside the BCE you put a negative target, there be dragons!

You can fix this in the compute_loss function by not only computing the giou (used in the box loss), but also the iou (to use as an objectness target). Something like this:

giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True)  # giou computation
iou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=False)
lbox += (1.0 - giou).sum() if red == 'sum' else (1.0 - giou).mean()  # giou loss
tobj[b, a, gj, gi] = iou.detach().type(tobj.dtype) if giou_flag else 1.0 # target Pobj is either the IoU, or simply 1, depending on the flag