Describe the bug
WARNING: non-finite loss, ending training tensor([ nan, 0.53250, 1.26820, nan], device='cuda:0') ...error on EPOCH 2/272 while attempting to train my own custom dataset (with 2 classes) using ultralytics and the yolo3.cfg.
Screen shot:
https://gyazo.com/6e7828ef7f0473fd0b955682bc4421b4
I'm using only 2 custom classes in my roulette.names, which are:
Ball
Zero
roulette.data contains paths:
classes = 2
train = data/train.txt
valid = data/test.txt
names = data/roulette.names
backup = backup/roulette3
To Reproduce
Steps to reproduce the behavior:
data/images folder and created a data/labels folder where I placed the corresponding label .txts.data directorydata directory. python train.py --data data/roulette.data --cfg cfg/roulette3.cfg --batch-size 8Expected behavior
I expected to be able to train my data.
Screenshots
Screen shot:
https://gyazo.com/6e7828ef7f0473fd0b955682bc4421b4
Desktop (please complete the following information):
Smartphone (please complete the following information):
N/A
Additional context
I'm just trying to train custom data of a roulette wheel and ball, I want to be able to detect the ball and roulette wheel ONLY and I've already annotated my own custom data with those 2 classes. Am I missing something here?
Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?
Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?
You were spot on with the GIOU weight. I reduced LR to 0.0001 but it made no difference, but editing GIOU weight of train.py (line 19) was the fix and now I'm able to complete all epochs of training with no "non-finite loss" etc... error. (Reduced GIOU from 1.582 to 1)
Hopefully this helps anyone else out if there were having a similar error.
Thank you very much sir ❤
@alpizano hmmm awesome!! It would be nice to somehow make the GIoU loss more robust to this but I can't figure out an easy way to do it. Anyway, good to know lowering the weight fixed it.
Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?
You were spot on with the GIOU weight. I reduced LR to 0.0001 but it made no difference, but editing GIOU weight of
train.py(line 19) was the fix and now I'm able to complete all epochs of training with no "non-finite loss" etc... error. (Reduced GIOU from 1.582 to 1)Hopefully this helps anyone else out if there were having a similar error.
Thank you very much sir ❤
I have encountered the same problem. When I turn the GIoU down to 1, the problem is still there. Then I lowered the learning rate, the problem was fixed. I don't know why, maybe the hyperparameters' setting are related to the training datasets. Thank you for your help.
Thanks for the bug report. Yes this can happen if the learning rate is too high, the losses diverge to inf/nan. Can you try a reduced LR or a reduced GIOU weight?
I lowered LR and GIOU, then the problem was fixed. Thank you.
I had a similar issue, spotted it while experimenting with the focal loss. I had a nan for the objectness loss. It was caused by setting the targets for the objectness measure equal to the giou, however the giou can be between -1 and +1 and not between 0 and +1. If inside the BCE you put a negative target, there be dragons!
You can fix this in the compute_loss function by not only computing the giou (used in the box loss), but also the iou (to use as an objectness target). Something like this:
giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True) # giou computation
iou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=False)
lbox += (1.0 - giou).sum() if red == 'sum' else (1.0 - giou).mean() # giou loss
tobj[b, a, gj, gi] = iou.detach().type(tobj.dtype) if giou_flag else 1.0 # target Pobj is either the IoU, or simply 1, depending on the flag
@tibistrat latest code has a clamp(0) here to prevent this.
I see this problem with the lastest version.
This issue should be resolved in yolov5:
https://github.com/ultralytics/yolov5
Most helpful comment
You were spot on with the GIOU weight. I reduced LR to 0.0001 but it made no difference, but editing GIOU weight of
train.py(line 19) was the fix and now I'm able to complete all epochs of training with no "non-finite loss" etc... error. (Reduced GIOU from 1.582 to 1)Hopefully this helps anyone else out if there were having a similar error.
Thank you very much sir ❤