Py-faster-rcnn: bbox_transform.py:48: RuntimeWarning: overflow encountered in exp ...

Created on 14 Jan 2016 · 35Comments · Source: rbgirshick/py-faster-rcnn

in proposal_layer.py 's forward() function, when i print out bbox_deltas.min() and bbox_deltas.max(), at some point it suddenly become large and cause overflow and core dump, here is the log:

-0.843478 0.695785
-1.53431 1.09048
-2.39332 1.81395
-2.74009 1.98957
-0.368922 0.236118
-0.707322 0.23115
I0115 01:02:23.016412 31390 solver.cpp:242] Iteration 40, loss = 1.8799
I0115 01:02:23.016444 31390 solver.cpp:258]     Train net output #0: loss_bbox = 0.040359 (* 1 = 0.040359 loss)
I0115 01:02:23.016451 31390 solver.cpp:258]     Train net output #1: loss_cls = 0.240918 (* 1 = 0.240918 loss)
I0115 01:02:23.016456 31390 solver.cpp:258]     Train net output #2: rpn_cls_loss = 0.585994 (* 1 = 0.585994 loss)
I0115 01:02:23.016461 31390 solver.cpp:258]     Train net output #3: rpn_loss_bbox = 0.883768 (* 1 = 0.883768 loss)
I0115 01:02:23.016472 31390 solver.cpp:571] Iteration 40, lr = 0.001
-5.10243 4.21369
-5.00325 3.99131
-6.54417 1.98163
-8.08618 2.3706
-8.88115 1.3943
-3.91337 0.64184
-1.92415 2.32623
-1.19933 1.09659
-4.12942 3.17897
-4.96536 3.46139
-2.6374 1.79074
-2.49792 1.71725
-17.6217 14.487
-22.151 18.5744
-21.8844 17.797
-16.2733 13.0881
-1.50187 1.95901
-1.43967 1.50989
-14.1043 28.3903
-9.91849 19.1581
-31.6936 8.46099
-27.262 7.1318
-28.2349 125.76
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
  pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
  pred_w = np.exp(dw) * widths[:, np.newaxis]
-26.7032 118.203
-741.881 505.883
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
  pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
  pred_h = np.exp(dh) * heights[:, np.newaxis]
-692.391 472.156
-9.47346e+25 1.02213e+26
-6.35599e+25 6.85811e+25
nan nan
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
  keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/faster_rcnn_end2end_handdet.sh: line 39: 31390 Floating point exception(core dumped

Anyone can help figuring out what could be the problem?

Source

ZhengRui

👍2

Most helpful comment

The issue that causes this NaN in the loss is because the dw and dh explode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)

The new Detectron code released has a fix for this.

Just update your config file to have a line:

__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)

and add these lines just before the predict_ctr_x is computed in your bbox_transform.py:

    # Prevent sending too large values into np.exp()
    dw = np.minimum(dw, cfg.BBOX_XFORM_CLIP)
    dh = np.minimum(dh, cfg.BBOX_XFORM_CLIP)

It works with any RNG_SEED and very high learning rates (lr = 0.01)

meetshah1995 on 19 Feb 2018

👍12 ❤2

All 35 comments

I have the same issue here.
In my case, it didn't happen when I used cuda toolkit 6.5, but after changing to 7.5, it happens for every training cases. I re-installed OS(ubuntu 14.04) and all other libraries again, but it didn't disappear...

pyoguy on 15 Jan 2016

After many tries and test, i found it is related to RNG_SEED, for my dataset, using VGG16 pre-trained model with default RNG_SEED = 3, it always leads to the instability of dw and dh (in the paper are tw and th).

Symptom: if print out dw.max() and dh.max(), after a certain point between iterations 20 and 40, it becomes larger than 10, then in the following iterations, they could oscillate between a huge value (which could cause the termination of the program as shown above) and almost 0.

Using default random seed, I still got a very low chance to make it go through the whole training process successfully if i tried many times. If dw.max() and dh.max() remains smaller than 2 or 3 in the first 200 iterations then it's very likely the training process has passed the dangerous zone and is already on the right track. After changing the random seed to another value like 17, almost every try goes properly. And the default value 3 works perfectly for ZF net.

So my conclusion is try to change the RNG_SEED value if met the same problem, watch out the maximum value of dw and dh, if it goes to more than 400 iterations without overflow problem, then it gonna be fine.

Btw, i tried to change batch size and rpn batch size in yml file, after caffe loaded it did show the batch size values i set but the memory usage of gpu seems the same, is it normal?

ZhengRui on 16 Jan 2016

👍7 😕3

@ZhengRui I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 3 and 4 respectively. But they suddenly becomes nan without a gradual change. It is confusing!
Also, changing the random seed to another value like 17, it can runs properly ,but the next time it failed.Larger random seed seems to encounter with the same situation. So, I have difficulty understanding the function of RNG_SEED. Is it going to impact the speed of convergence or something else? Looking forward to your reply.Thank you!

MenglaiWang on 19 Jan 2016

RNG_SEED is just the seed for random number generator, you can change it to whatever number so long as it works. So change it to some number that you have a lower chance to meet this issue. I don't think it will impact the speed of convergence of something else, it's just a random seed.

ZhengRui on 19 Jan 2016

I got this issue, too. Since I use training images that are cropped on target objects, I suspect that the use of bounding box closed to image size cause this problem. I haven't found out solution yet.

LiberiFatali on 18 Feb 2016

You can try to pad zeros around your training images before feeding to the network so that the ratio of your target objects is smaller. The network always rescale your input images to around 600x1000 or 1000x600 in the first step, so simply downsizing images won't work, have to do padding.

ZhengRui on 18 Feb 2016

I also encountered this problems recently, this my suggestion to check what bug comes from. Most importantly, I found this bug shown when load error roi bounding box information into db (e.g. 65535), and it causes by error implementation of loading annotation.
In my loading annotation method, I copy the pascal loading annotation method to load xml file and that method will minus all bbox value(x1, y1, x2, y2) with 1, so if where occurs any value is 0(usinged int) and minus 1 will cause 65535.

RyanLiuNtust on 12 May 2016

👍10

Has anyone solved this? I tried @RyanLiuNtust @ZhengRui 's method, but it doesn't work for me.
I got this error:

/faster-rcnn-py/tools/../lib/fast_rcnn/bbox_transform.py:50: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
faster-rcnn-py/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]

I output 'ws, hs' , and they come to be "NAN".
Could anyone help?

neuleaf on 22 May 2016

👍9

A possible solution could be to decrease the base learning rate in the solver.prototxt
As it is recommended here http://caffe.berkeleyvision.org/tutorial/solver.html
Just change the base_lr: 0.001 to 0.0001

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

I did try to change the base_lr value and now the NAN value disappeared.

azamattokhtaev on 31 May 2016

👍6 😕1

A question... @ZhengRui what do you mean by "You can try to pad zeros around your training images"? Can you give a quick example? Thanks!

cyberdecker on 20 Jul 2016

*********
***obj***
*********

*********************
*********************
*********obj*********
*********************
*********************

ZhengRui on 20 Jul 2016

is there a caffe argument to zero pad all the images ?

vikiboy on 21 Jul 2016

I am not sure if caffe has it now or not, but i did it by myself before everything as data augmentation

def paddingzeros(im, desMin, desMax):
    if im.shape[0] <= im.shape[1]:
        if im.shape[0] < desMin:
            im = np.lib.pad(im, (((desMin-im.shape[0])/2, (desMin+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMax:
            im = np.lib.pad(im, ((0, 0), ((desMax-im.shape[1])/2, (desMax+1-im.shape[1])/2), (0, 0)), 'constant')

    if im.shape[0] > im.shape[1]:
        if im.shape[0] < desMax:
            im = np.lib.pad(im, (((desMax-im.shape[0])/2, (desMax+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
        if im.shape[1] < desMin:
            im = np.lib.pad(im, ((0, 0), ((desMin-im.shape[1])/2, (desMin+1-im.shape[1])/2), (0, 0)), 'constant')

    print 'after padding: ', im.shape
    return im


im = paddingzeros(im, 600, 1000)

ZhengRui on 22 Jul 2016

Doing this would also require changing the values of the bounding boxes in the annotation XML files as well ?

vikiboy on 22 Jul 2016

yes, also have to change the bounding boxes annotations in padded images by adding the offsets in width and height directions

ZhengRui on 22 Jul 2016

Can someone explain the intuition behind this issue? Is the backpropagation algorithm oscillating with increasingly bigger steps around the optimum, eventually causing overflows in python?

If so, are these understandings then correct:

A different starting point (aka changing RNG_SEED) might make it optimize correctly.
A lower learning rate should prevent it from starting to oscillate out of control, as suggested by @azamattokhtaev

I'm trying to get a better understanding of what is going wrong here :).

hgaiser on 2 Aug 2016

For me it was @RyanLiuNtust method that worked. Fixed the ground truth xml files so coordinates are 1-based

assafmus on 30 Aug 2016

@neuleaf Have you solved the problem? I meet the same problem like yours, my output is also NAN. I have checked the output of bottom[0],bottom[1],and bottom[2] in proposal_layer.py. Only bottom[2] have value. And i have tried the solution of @azamattokhtaev ,but it doesn't help.

fbi0817 on 17 Oct 2016

where to change the RNG_SEED value?
@ZhengRui

zwyzwy on 25 Oct 2016

For me the issue was resolved when lowering the learning rate in solver.prototxt.

The RNG_SEED can be changed in the config file.

hgaiser on 25 Oct 2016

@fbi0817 @ZhengRui Did you solve the problem? I am facing the same thing. I tried to reduce the learning rate but had no progress. Suddenly my loss turns to nan and I receive the warning overflow encountered in exp (around iteration 6000). I am using Pascal Voc2007. Could you help me, please?

fernandorovai on 6 Dec 2016

@ZhengRui @fernandorovai Have you solved the problem? I am also running into the same issue.

DeepDriving on 10 Jun 2017

Hey I changed the lr to 0.00001, removed the minus 1 while reading the annotations and changed the RNG seed to 17. Still keep getting the error.

abhiML on 26 Jun 2017

@abhiML Study this: https://github.com/longcw/faster_rcnn_pytorch/blob/master/faster_rcnn/network.py#L109

acgtyrant on 26 Jun 2017

I also encountered this problem, and i changed RNG_SEED = 4 solved it

jiangwqcooler on 5 Sep 2017

😕4 ❤1 🎉1 😄1 👍1

I have this problem when running on my own dataset, things to check is

incorrect bbox => better to have some check the load_annotation function
learning rate too large => reduce the learning rate

skyuuka on 24 Nov 2017

👍1

@skyuuka better to have some check the load_annotation function,How is your function written?

ml930310 on 6 Dec 2017

As bbox_transform_inv is just for draw the bbox on the image, it's none of business with training. Reduce the learning rate can avoid the occurrence of nan.

AIML on 22 Jan 2018

The issue that causes this NaN in the loss is because the dw and dh explode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)

The new Detectron code released has a fix for this.

Just update your config file to have a line:

__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)

and add these lines just before the predict_ctr_x is computed in your bbox_transform.py:

    # Prevent sending too large values into np.exp()
    dw = np.minimum(dw, cfg.BBOX_XFORM_CLIP)
    dh = np.minimum(dh, cfg.BBOX_XFORM_CLIP)

It works with any RNG_SEED and very high learning rates (lr = 0.01)

meetshah1995 on 19 Feb 2018

👍12 ❤2

@ZhengRui @pyoguy @MenglaiWang @LiberiFatali @RyanLiuNtust
hi,guys ,when I train FPN on my own dataset,I met the same error:

I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
/home/zq/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating point exception (core dumped)

I try to change lr from 0.001 to 0.0001,but it didn't work.I also change RNG_SEED,and it also didn't work.
I don't know how to solve it.please help me,thanks so much!

zqdeepbluesky on 12 Mar 2018

@meetshah1995 use cpu only mode, after apply your solution, the problem still exist, though no nan value
ws [1. 1. 1. ... 1. 1. 1.] hs [1. 1. 1. ... 1. 1. 1.] min_size 25.600000381469727 keep [] experiments/scripts/faster_rcnn_end2end.sh: line 58: 67407 Floating point exception(core dumped) ./tools/train_net.py --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}

kingchenchina on 23 Mar 2018

@MenglaiWang I use multi gpu mode with you solution,the problem still exists .

ygren on 24 Mar 2018

@kingchenchina Seeing so many 1., the problem is very likely that the generated proposals are all just one pixel in length. Then, in proposal_layer.py, it calls _filter_boxes( ), which makes no proposals left. Then, the empty proposals will be used as rois blobs, which, in its reshaping function gives floating point exception.

Can use the following strategy to avoid empty proposals:

        keep = _filter_boxes(proposals, min_size * im_info[2])
        if len(keep)!=0:
            proposals = proposals[keep, :]
            scores = scores[keep]

But then the loss becomes nan. So turn down learning rate would be a better approach.

zchrissirhcz on 8 Apr 2018

I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 7489.9507 and 11519.379 respectively. I can't understand why there is such a large number. I don't know why. I hope someone can give us some advice.

niuniu111 on 7 May 2018

I solved my 'Floating point exception (core dumped)' by modifying the function 'is_valid' in function 'filter_roidb' in file da-faster-rcnn-master/lib/fast_rcnn/train.py:

def filter_roidb(roidb):
"""Remove roidb entries that have no usable RoIs."""

def is_valid(entry):
    # Valid images have:
    #   (1) At least one foreground RoI OR
    #   (2) At least one background RoI
    overlaps = entry['max_overlaps']
    # added to handle empty boxes, see https://github.com/rbgirshick/py-faster-rcnn/issues/159
    not_empty = np.zeros(len(entry['max_overlaps']), dtype=bool)
    cur_boxes = entry['boxes']
    for i in range(len(not_empty)):
        if (cur_boxes[i][2] - cur_boxes[i][0] > 1 and cur_boxes[i][3] - cur_boxes[i][1] > 1):
            not_empty[i] = True

    # find boxes with sufficient overlap
    fg_inds = np.where(overlaps >= cfg.TRAIN.FG_THRESH)[0]
    # Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
    bg_inds = np.where((overlaps < cfg.TRAIN.BG_THRESH_HI) &
                       (overlaps >= cfg.TRAIN.BG_THRESH_LO) & not_empty)[0]

    # image is only valid if such boxes exist
    valid = len(fg_inds) > 0 or len(bg_inds) > 0

    return valid