Faster-rcnn.pytorch: Issue ：loss: nan

Created on 6 Jan 2019 · 13Comments · Source: jwyang/faster-rcnn.pytorch

I was used author's offered datasets, with the pretrained model vgg16_caffe.pth. When I try to train the model . It's show loss:nan rcnn_cls: nan, rcnn_box nan. I've tried make
x1 = float(bbox.find('xmin').text) #- 1
y1 = float(bbox.find('ymin').text) #- 1
x2 = float(bbox.find('xmax').text) #- 1
y2 = float(bbox.find('ymax').text) #- 1
however it's doesn't works.

before filtering, there are 10022 images...
after filtering, there are 10022 images...
10022 roidb entries
Loading pretrained weights from data/pretrained_model/vgg16_caffe.pth
[session 1][epoch 1][iter 0/1670] loss: nan, lr: 1.00e-06
fg/bg=(157/1379), time cost: 1.932488
rpn_cls: 0.7493, rpn_box: 0.2117, rcnn_cls: nan, rcnn_box nan
[session 1][epoch 1][iter 100/1670] loss: nan, lr: 1.00e-06
fg/bg=(1536/0), time cost: 193.700977
rpn_cls: nan, rpn_box: nan, rcnn_cls: nan, rcnn_box nan
^CTraceback (most recent call last):
File "trainval_net.py", line 330, in

Source

JanVin

All 13 comments

VOC data set has bounding box information present as 1-indexed. Therefore it is necessary to include -1 that you have commented out. It might be the cause for the incorrect loss computation resulting in nan.

adityaarun1 on 9 Jan 2019

👍1

I met the same problem and still don't know how to solve it.

EmmaSRH on 14 May 2019

I used res101 net to instead and it works.

-------- 原始邮件 --------
主题：Re: [jwyang/faster-rcnn.pytorch] Issue ：loss: nan (#416)
发件人：Emma
收件人："jwyang/faster-rcnn.pytorch"
抄送：JanVin ,Author

I met the same problem and still don't know how to solve it.

―
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/jwyang/faster-rcnn.pytorch/issues/416?email_source=notifications&email_token=AK5PKWI65ZRQP3R7OWAOPWLPVLJ4NA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVL2LDQ#issuecomment-492283278, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK5PKWN4UC6SZBYMTI2YNODPVLJ4NANCNFSM4GOJWE3A.

JanVin on 14 May 2019

I just use the res101 can got the same problem again.

EmmaSRH on 14 May 2019

Which branch are you used?

-------- 原始邮件 --------
主题：Re: [jwyang/faster-rcnn.pytorch] Issue ：loss: nan (#416)
发件人：Emma
收件人："jwyang/faster-rcnn.pytorch"
抄送：JanVin ,Author

I just use the res101 can got the same problem again.

―
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/jwyang/faster-rcnn.pytorch/issues/416?email_source=notifications&email_token=AK5PKWL32PW2JWT7XVACU3LPVLLAHA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVL3JNQ#issuecomment-492287158, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK5PKWIOX2XTI64E4CAWDCLPVLLAHANCNFSM4GOJWE3A.

JanVin on 15 May 2019

Try master branch

-------- 原始邮件 --------
主题：Re: [jwyang/faster-rcnn.pytorch] Issue ：loss: nan (#416)
发件人：Emma
收件人："jwyang/faster-rcnn.pytorch"
抄送：JanVin ,Author

I just use the res101 can got the same problem again.

―
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHubhttps://github.com/jwyang/faster-rcnn.pytorch/issues/416?email_source=notifications&email_token=AK5PKWL32PW2JWT7XVACU3LPVLLAHA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVL3JNQ#issuecomment-492287158, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK5PKWIOX2XTI64E4CAWDCLPVLLAHANCNFSM4GOJWE3A.

JanVin on 15 May 2019

I used the pytorch1.0 branch. I think maybe the difference is only the cuda build but not the code inside? What I want to know is that what caused this 'nan' loss? Dataset or something else?

EmmaSRH on 15 May 2019

Most likely it's the dataset that caused the 'nan' loss(I have encountered this kind of problem myself too). You have to make sure that none of the coordinates of bounding boxes is (0,0). That is, in the xml files, none of the pairs (xmin, ymin) should be (0,0). If there is any pair of (xmin, ymin) that is (0,0) in the xml files, you have to change it to (1,1) at least.
BTW, deleting files under the directory "cache" is necessary before training after you make sure there's no (xmin, ymin) = (0,0) in the xml files. Hope this method helps you.

ensenginbaieer on 17 Jun 2019

👍1

I just use the res101 can got the same problem again.

I meet the same error too. have you solve it ? thanks.

zhengxinvip on 26 Jul 2019

我记得好像是数据问题，就是检查一下用的数据集中有没有边界爆0的情况，在不同的backbone下面好像有写死的东西，要检查一下每一个的输出。发自我的iPhone------------------ Original ------------------From: zhengxinvip notifications@github.comDate: Fri,Jul 26,2019 9:28 PMTo: jwyang/faster-rcnn.pytorch faster-rcnn.pytorch@noreply.github.comCc: Emma shiruohua@pku.edu.cn, Comment comment@noreply.github.comSubject: Re: [jwyang/faster-rcnn.pytorch] Issue ：loss: nan (#416)
I just use the res101 can got the same problem again.

I meet the same error too. have you solve it ? thanks.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "https://github.com/jwyang/faster-rcnn.pytorch/issues/416?email_source=notifications\u0026email_token=AGA24NPHKAMAO7AKLM5PS7LQBL3ZVA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD24SYFI#issuecomment-515451925",
"url": "https://github.com/jwyang/faster-rcnn.pytorch/issues/416?email_source=notifications\u0026email_token=AGA24NPHKAMAO7AKLM5PS7LQBL3ZVA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD24SYFI#issuecomment-515451925",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]

EmmaSRH on 29 Jul 2019

我记得好像是数据问题，就是检查一下用的数据集中有没有边界爆0的情况，在不同的backbone下面好像有写死的东西，要检查一下每一个的输出。发自我的iPhone------------------ Original ------------------From: zhengxinvip notifications@github.comDate: Fri,Jul 26,2019 9:28 PMTo: jwyang/faster-rcnn.pytorch faster-rcnn.pytorch@noreply.github.comCc: Emma shiruohua@pku.edu.cn, Comment comment@noreply.github.comSubject: Re: [jwyang/faster-rcnn.pytorch] Issue ：loss: nan (#416)
I just use the res101 can got the same problem again.

I meet the same error too. have you solve it ? thanks.

—You are receiving this because you commented.Reply to this email directly, view it on GitHub, or mute the thread.
[
{
"@context": "http://schema.org",
"@type": "EmailMessage",
"potentialAction": {
"@type": "ViewAction",
"target": "#416?email_source=notifications\u0026email_token=AGA24NPHKAMAO7AKLM5PS7LQBL3ZVA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD24SYFI#issuecomment-515451925",
"url": "#416?email_source=notifications\u0026email_token=AGA24NPHKAMAO7AKLM5PS7LQBL3ZVA5CNFSM4GOJWE3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD24SYFI#issuecomment-515451925",
"name": "View Issue"
},
"description": "View this Issue on GitHub",
"publisher": {
"@type": "Organization",
"name": "GitHub",
"url": "https://github.com"
}
}
]