in proposal_layer.py 's forward() function, when i print out bbox_deltas.min() and bbox_deltas.max(), at some point it suddenly become large and cause overflow and core dump, here is the log:
-0.843478 0.695785
-1.53431 1.09048
-2.39332 1.81395
-2.74009 1.98957
-0.368922 0.236118
-0.707322 0.23115
I0115 01:02:23.016412 31390 solver.cpp:242] Iteration 40, loss = 1.8799
I0115 01:02:23.016444 31390 solver.cpp:258] Train net output #0: loss_bbox = 0.040359 (* 1 = 0.040359 loss)
I0115 01:02:23.016451 31390 solver.cpp:258] Train net output #1: loss_cls = 0.240918 (* 1 = 0.240918 loss)
I0115 01:02:23.016456 31390 solver.cpp:258] Train net output #2: rpn_cls_loss = 0.585994 (* 1 = 0.585994 loss)
I0115 01:02:23.016461 31390 solver.cpp:258] Train net output #3: rpn_loss_bbox = 0.883768 (* 1 = 0.883768 loss)
I0115 01:02:23.016472 31390 solver.cpp:571] Iteration 40, lr = 0.001
-5.10243 4.21369
-5.00325 3.99131
-6.54417 1.98163
-8.08618 2.3706
-8.88115 1.3943
-3.91337 0.64184
-1.92415 2.32623
-1.19933 1.09659
-4.12942 3.17897
-4.96536 3.46139
-2.6374 1.79074
-2.49792 1.71725
-17.6217 14.487
-22.151 18.5744
-21.8844 17.797
-16.2733 13.0881
-1.50187 1.95901
-1.43967 1.50989
-14.1043 28.3903
-9.91849 19.1581
-31.6936 8.46099
-27.262 7.1318
-28.2349 125.76
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in exp
pred_w = np.exp(dw) * widths[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:48: RuntimeWarning: overflow encountered in multiply
pred_w = np.exp(dw) * widths[:, np.newaxis]
-26.7032 118.203
-741.881 505.883
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/fast_rcnn/bbox_transform.py:49: RuntimeWarning: overflow encountered in multiply
pred_h = np.exp(dh) * heights[:, np.newaxis]
-692.391 472.156
-9.47346e+25 1.02213e+26
-6.35599e+25 6.85811e+25
nan nan
/home/zerry/Work/Libs/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
./experiments/scripts/faster_rcnn_end2end_handdet.sh: line 39: 31390 Floating point exception(core dumped
Anyone can help figuring out what could be the problem?
I have the same issue here.
In my case, it didn't happen when I used cuda toolkit 6.5, but after changing to 7.5, it happens for every training cases. I re-installed OS(ubuntu 14.04) and all other libraries again, but it didn't disappear...
After many tries and test, i found it is related to RNG_SEED, for my dataset, using VGG16 pre-trained model with default RNG_SEED = 3, it always leads to the instability of dw and dh (in the paper are tw and th).
Symptom: if print out dw.max() and dh.max(), after a certain point between iterations 20 and 40, it becomes larger than 10, then in the following iterations, they could oscillate between a huge value (which could cause the termination of the program as shown above) and almost 0.
Using default random seed, I still got a very low chance to make it go through the whole training process successfully if i tried many times. If dw.max() and dh.max() remains smaller than 2 or 3 in the first 200 iterations then it's very likely the training process has passed the dangerous zone and is already on the right track. After changing the random seed to another value like 17, almost every try goes properly. And the default value 3 works perfectly for ZF net.
So my conclusion is try to change the RNG_SEED value if met the same problem, watch out the maximum value of dw and dh, if it goes to more than 400 iterations without overflow problem, then it gonna be fine.
Btw, i tried to change batch size and rpn batch size in yml file, after caffe loaded it did show the batch size values i set but the memory usage of gpu seems the same, is it normal?
@ZhengRui I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 3 and 4 respectively. But they suddenly becomes nan without a gradual change. It is confusing!
Also, changing the random seed to another value like 17, it can runs properly ,but the next time it failed.Larger random seed seems to encounter with the same situation. So, I have difficulty understanding the function of RNG_SEED. Is it going to impact the speed of convergence or something else? Looking forward to your reply.Thank you!
RNG_SEED is just the seed for random number generator, you can change it to whatever number so long as it works. So change it to some number that you have a lower chance to meet this issue. I don't think it will impact the speed of convergence of something else, it's just a random seed.
I got this issue, too. Since I use training images that are cropped on target objects, I suspect that the use of bounding box closed to image size cause this problem. I haven't found out solution yet.
You can try to pad zeros around your training images before feeding to the network so that the ratio of your target objects is smaller. The network always rescale your input images to around 600x1000 or 1000x600 in the first step, so simply downsizing images won't work, have to do padding.
I also encountered this problems recently, this my suggestion to check what bug comes from. Most importantly, I found this bug shown when load error roi bounding box information into db (e.g. 65535), and it causes by error implementation of loading annotation.
In my loading annotation method, I copy the pascal loading annotation method to load xml file and that method will minus all bbox value(x1, y1, x2, y2) with 1, so if where occurs any value is 0(usinged int) and minus 1 will cause 65535.
Has anyone solved this? I tried @RyanLiuNtust @ZhengRui 's method, but it doesn't work for me.
I got this error:
/faster-rcnn-py/tools/../lib/fast_rcnn/bbox_transform.py:50: RuntimeWarning: overflow encountered in exp
pred_h = np.exp(dh) * heights[:, np.newaxis]
faster-rcnn-py/tools/../lib/rpn/proposal_layer.py:176: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
I output 'ws, hs' , and they come to be "NAN".
Could anyone help?
A possible solution could be to decrease the base learning rate in the solver.prototxt
As it is recommended here http://caffe.berkeleyvision.org/tutorial/solver.html
Just change the base_lr: 0.001 to 0.0001
Note also that the above settings are merely guidelines, and they鈥檙e definitely not guaranteed to be optimal (or even work at all!) in every situation. If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.
I did try to change the base_lr value and now the NAN value disappeared.
A question... @ZhengRui what do you mean by "You can try to pad zeros around your training images"? Can you give a quick example? Thanks!
*********
***obj***
*********
to
*********************
*********************
*********obj*********
*********************
*********************
is there a caffe argument to zero pad all the images ?
I am not sure if caffe has it now or not, but i did it by myself before everything as data augmentation
def paddingzeros(im, desMin, desMax):
if im.shape[0] <= im.shape[1]:
if im.shape[0] < desMin:
im = np.lib.pad(im, (((desMin-im.shape[0])/2, (desMin+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
if im.shape[1] < desMax:
im = np.lib.pad(im, ((0, 0), ((desMax-im.shape[1])/2, (desMax+1-im.shape[1])/2), (0, 0)), 'constant')
if im.shape[0] > im.shape[1]:
if im.shape[0] < desMax:
im = np.lib.pad(im, (((desMax-im.shape[0])/2, (desMax+1-im.shape[0])/2), (0, 0), (0, 0)), 'constant')
if im.shape[1] < desMin:
im = np.lib.pad(im, ((0, 0), ((desMin-im.shape[1])/2, (desMin+1-im.shape[1])/2), (0, 0)), 'constant')
print 'after padding: ', im.shape
return im
im = paddingzeros(im, 600, 1000)
Doing this would also require changing the values of the bounding boxes in the annotation XML files as well ?
yes, also have to change the bounding boxes annotations in padded images by adding the offsets in width and height directions
Can someone explain the intuition behind this issue? Is the backpropagation algorithm oscillating with increasingly bigger steps around the optimum, eventually causing overflows in python?
If so, are these understandings then correct:
I'm trying to get a better understanding of what is going wrong here :).
For me it was @RyanLiuNtust method that worked. Fixed the ground truth xml files so coordinates are 1-based
@neuleaf Have you solved the problem? I meet the same problem like yours, my output is also NAN. I have checked the output of bottom[0],bottom[1],and bottom[2] in proposal_layer.py. Only bottom[2] have value. And i have tried the solution of @azamattokhtaev ,but it doesn't help.
where to change the RNG_SEED value?
@ZhengRui
For me the issue was resolved when lowering the learning rate in solver.prototxt.
The RNG_SEED can be changed in the config file.
@fbi0817 @ZhengRui Did you solve the problem? I am facing the same thing. I tried to reduce the learning rate but had no progress. Suddenly my loss turns to nan and I receive the warning overflow encountered in exp (around iteration 6000). I am using Pascal Voc2007. Could you help me, please?
@ZhengRui @fernandorovai Have you solved the problem? I am also running into the same issue.
Hey I changed the lr to 0.00001, removed the minus 1 while reading the annotations and changed the RNG seed to 17. Still keep getting the error.
I also encountered this problem, and i changed RNG_SEED = 4 solved it
I have this problem when running on my own dataset, things to check is
@skyuuka better to have some check the load_annotation function,How is your function written?
As bbox_transform_inv is just for draw the bbox on the image, it's none of business with training. Reduce the learning rate can avoid the occurrence of nan.
The issue that causes this NaN in the loss is because the dw and dh explode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)
The new Detectron code released has a fix for this.
Just update your config file to have a line:
__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)
and add these lines just before the predict_ctr_x is computed in your bbox_transform.py:
# Prevent sending too large values into np.exp()
dw = np.minimum(dw, cfg.BBOX_XFORM_CLIP)
dh = np.minimum(dh, cfg.BBOX_XFORM_CLIP)
It works with any RNG_SEED and very high learning rates (lr = 0.01)
@ZhengRui @pyoguy @MenglaiWang @LiberiFatali @RyanLiuNtust
hi,guys ,when I train FPN on my own dataset,I met the same error:
I0312 16:25:25.883342 2983 sgd_solver.cpp:106] Iteration 0, lr = 0.0005
/home/zq/py-faster-rcnn/tools/../lib/rpn/proposal_layer.py:175: RuntimeWarning: invalid value encountered in greater_equal
keep = np.where((ws >= min_size) & (hs >= min_size))[0]
Floating point exception (core dumped)
I try to change lr from 0.001 to 0.0001,but it didn't work.I also change RNG_SEED,and it also didn't work.
I don't know how to solve it.please help me,thanks so much!
@meetshah1995 use cpu only mode, after apply your solution, the problem still exist, though no nan value
ws [1. 1. 1. ... 1. 1. 1.]
hs [1. 1. 1. ... 1. 1. 1.]
min_size 25.600000381469727
keep []
experiments/scripts/faster_rcnn_end2end.sh: line 58: 67407 Floating point exception(core dumped) ./tools/train_net.py --solver models/${PT_DIR}/${NET}/faster_rcnn_end2end/solver.prototxt --weights data/imagenet_models/${NET}.v2.caffemodel --imdb ${TRAIN_IMDB} --iters ${ITERS} --cfg experiments/cfgs/faster_rcnn_end2end.yml ${EXTRA_ARGS}
@MenglaiWang I use multi gpu mode with you solution,the problem still exists .
@kingchenchina Seeing so many 1., the problem is very likely that the generated proposals are all just one pixel in length. Then, in proposal_layer.py, it calls _filter_boxes( ), which makes no proposals left. Then, the empty proposals will be used as rois blobs, which, in its reshaping function gives floating point exception.
Can use the following strategy to avoid empty proposals:
keep = _filter_boxes(proposals, min_size * im_info[2])
if len(keep)!=0:
proposals = proposals[keep, :]
scores = scores[keep]
But then the loss becomes nan. So turn down learning rate would be a better approach.
I have met similar problem too.Follow your instruction, I print out dw.max() and dh.max(), they come around 7489.9507 and 11519.379 respectively. I can't understand why there is such a large number. I don't know why. I hope someone can give us some advice.
I solved my 'Floating point exception (core dumped)' by modifying the function 'is_valid' in function 'filter_roidb' in file da-faster-rcnn-master/lib/fast_rcnn/train.py:
def filter_roidb(roidb):
"""Remove roidb entries that have no usable RoIs."""
def is_valid(entry):
# Valid images have:
# (1) At least one foreground RoI OR
# (2) At least one background RoI
overlaps = entry['max_overlaps']
# added to handle empty boxes, see https://github.com/rbgirshick/py-faster-rcnn/issues/159
not_empty = np.zeros(len(entry['max_overlaps']), dtype=bool)
cur_boxes = entry['boxes']
for i in range(len(not_empty)):
if (cur_boxes[i][2] - cur_boxes[i][0] > 1 and cur_boxes[i][3] - cur_boxes[i][1] > 1):
not_empty[i] = True
# find boxes with sufficient overlap
fg_inds = np.where(overlaps >= cfg.TRAIN.FG_THRESH)[0]
# Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI)
bg_inds = np.where((overlaps < cfg.TRAIN.BG_THRESH_HI) &
(overlaps >= cfg.TRAIN.BG_THRESH_LO) & not_empty)[0]
# image is only valid if such boxes exist
valid = len(fg_inds) > 0 or len(bg_inds) > 0
return valid
Most helpful comment
The issue that causes this NaN in the loss is because the
dwanddhexplode to extremely high values in the intiial stages of training when using a large learning rate (lr > 0.01)The new Detectron code released has a fix for this.
Just update your config file to have a line:
__C.BBOX_XFORM_CLIP = np.log(1000. / 16.)and add these lines just before the
predict_ctr_xis computed in your bbox_transform.py:It works with any RNG_SEED and very high learning rates (lr = 0.01)