Faster-rcnn.pytorch: nan loss in the first epoch

Created on 14 May 2019 · 4Comments · Source: jwyang/faster-rcnn.pytorch

When I train the model I got 'nan' loss in the first epoch. Does anyone know what is the problem? Thanks a lot!

Source

EmmaSRH

Most helpful comment

I solved it by change the code in pascal_voc.py:
x1 = float(bbox.find('xmin').text)
y1 = float(bbox.find('ymin').text)
x2 = float(bbox.find('xmax').text)
y2 = float(bbox.find('ymax').text)
The '-1' operation caused this problem.
Thanks for the patience of AlexanderHustinx!

EmmaSRH on 15 May 2019

👍4

All 4 comments

[session 1][epoch 1][iter 600/ 967] loss: nan, lr: 1.00e-03
fg/bg=(512/0), time cost: 40.919098
rpn_cls: nan, rpn_box: nan, rcnn_cls: 2.5435, rcnn_box 0.0000

EmmaSRH on 14 May 2019

There are several issues that describe ways to address this.
It can be dependent on a few things, e.g. dataset labels, exploding gradients, etc.

What worked for me was to clip the gradients of the model during training:
clip_gradient(fasterRCNN, 10.)

In the standard train_val.py document this is already set when using a VGG16 backend here