Detectron: 4GPUS + Base_LR 0.01, training ResNet-18: AssertionError: Negative areas founds

Created on 14 Jun 2018  路  5Comments  路  Source: facebookresearch/Detectron

hi!
I train faster rcnn based on ResNet-18, and use pre-trained model. NUM_GPUS: 4 , Base_LR : 0.01.
CUDA_VISIBLE_DEVICES=4,5,6,7 python2 tools/train_net.py --cfg experiment/configs/e2e_faster_rcnn_R-18-FPN_2x.yaml OUTPUT_DIR output/detectron-output/R18`

Then I encounter the problem:
......
json_stats: {"accuracy_cls": 0.990042, "eta": "19:20:07", "iter": 480, "loss": 0.624757, "loss_bbox": 0.001191, "loss_cls": 0.068198, "loss_rpn_bbox_fpn2": 0.198201, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.345642, "loss_rpn_cls_fpn3": 0.000143, "loss_rpn_cls_fpn4": 0.000006, "loss_rpn_cls_fpn5": 0.000007, "loss_rpn_cls_fpn6": 0.000000, "lr": 0.009733, "mb_qsize": 64, "mem": 3889, "time": 0.367281}
json_stats: {"accuracy_cls": 0.990837, "eta": "19:18:54", "iter": 500, "loss": 0.652205, "loss_bbox": 0.000014, "loss_cls": 0.067818, "loss_rpn_bbox_fpn2": 0.237794, "loss_rpn_bbox_fpn3": 0.000000, "loss_rpn_bbox_fpn4": 0.000000, "loss_rpn_bbox_fpn5": 0.000000, "loss_rpn_bbox_fpn6": 0.000000, "loss_rpn_cls_fpn2": 0.341258, "loss_rpn_cls_fpn3": 0.001622, "loss_rpn_cls_fpn4": 0.000011, "loss_rpn_cls_fpn5": 0.000016, "loss_rpn_cls_fpn6": 0.000001, "lr": 0.010000, "mb_qsize": 64, "mem": 3889, "time": 0.366938}
E0613 08:03:55.007463 19757 pybind_state.h:409] Exception encountered running PythonOp function: AssertionError: Negative areas founds
At:
/data/chuli/detectron/detectron/utils/boxes.py(62): boxes_area
/data/chuli/detectron/detectron/modeling/FPN.py(514): map_rois_to_fpn_levels
/data/chuli/detectron/detectron/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels
/data/chuli/detectron/detectron/roi_data/fast_rcnn.py(286): _add_multilevel_rois
/data/chuli/detectron/detectron/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs
/data/chuli/detectron/detectron/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward
E0613 08:03:55.013839 19751 pybind_state.h:409] Exception encountered running PythonOp function: AssertionError: Negative areas founds
.......

I change Base_LR to 0.005, after 2000 iters the same error happens. Then I change it to 0.0025, it seems to be OK until now.
So the reason is really LR? What we should do is to try the Base_LR again and again?

Most helpful comment

This is be addressed at a more fundamental level by 47e457a.

All 5 comments

Getting a very similar error after 10k iterations when I update the learning rate.

This comment https://github.com/facebookresearch/Detectron/issues/267#issuecomment-377339845 suggests that under certain conditions you can try to bypass this assert and it is working, maybe you could give it a shot.

@lilichu can you tell me how to limit the gpu id? #551

i also find the problem, and i solved it by reduced the learning_rate

This is be addressed at a more fundamental level by 47e457a.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Adhders picture Adhders  路  3Comments

kampelmuehler picture kampelmuehler  路  4Comments

lilichu picture lilichu  路  3Comments

junxiaoge picture junxiaoge  路  3Comments

gaopeng-eugene picture gaopeng-eugene  路  4Comments