Detectron: Keypoints training using 1GPU error

Created on 10 Mar 2018  ·  8Comments  ·  Source: facebookresearch/Detectron

I try to train e2e keypoints model using 1 gpu, some modifications list below

  1. modify train_net.py to use local downloaded weights
@@ -107,7 +107,7 @@ def main():
-    assert_and_infer_cfg()
+    assert_and_infer_cfg(cache_urls=False)
  1. add a new config yaml which is similar to e2e_keypoint_rcnn_R-50-FPN_1x.yaml
NUM_GPUS: 8 -> NUM_GPUS: 1
TRAIN:
  WEIGHTS: https://s3-us-west-2.amazonaws.com/detectron/ImageNetPretrained/MSRA/R-50.pkl ->
  WEIGHTS: /home/zoud/Workspace/Github/7oud/Detectron/model/imagenet_pretrained/R-50.pkl

  DATASETS: ('keypoints_coco_2014_train', 'keypoints_coco_2014_valminusminival') ->
  DATASETS: ('keypoints_coco_2014_train',)
  1. unzip the coco dataset to lib/datasets/data/coco and run command
(caffe2) zoud@i7:~/Workspace/Github/7oud/Detectron$ python2 tools/train_net.py --cfg configs/12_2017_baselines/e2e_keypoint_rcnn_R-50-FPN_1x_1gpu.yaml OUTPUT_DIR tmp/detectron-output/

Actual results

I0310 15:33:29.882928  8712 operator.cc:173] Operator with engine CUDNN is not available for operator SafeEnqueueBlobs.
E0310 15:33:31.579140  8743 pybind_state.h:422] Exception encountered running PythonOp function: AssertionError: Negative areas founds

At:
  /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area
  /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs
  /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at pybind_state.h:423] . Exception encountered running PythonOp function: AssertionError: Negative areas founds

At:
  /home/zoud/Workspace/Github/7oud/Detectron/lib/utils/boxes.py(62): boxes_area
  /home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/FPN.py(449): map_rois_to_fpn_levels
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(278): _distribute_rois_over_fpn_levels
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(286): _add_multilevel_rois
  /home/zoud/Workspace/Github/7oud/Detectron/lib/roi_data/fast_rcnn.py(121): add_fast_rcnn_blobs
  /home/zoud/Workspace/Github/7oud/Detectron/lib/ops/collect_and_distribute_fpn_rpn_proposals.py(60): forward
 Error from operator: 
input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 0 } debug_info: "  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in <module>\n    main()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n    checkpoints = train_model()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n    model, start_iter, checkpoints, output_dir = create_model()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n    model = model_builder.create(cfg.MODEL.TYPE, train=True)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n    return get_func(model_type_func)(model)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n    freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n    optim.build_data_parallel_model(model, _single_gpu_build_func)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n    all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n    all_loss_gradients.update(single_gpu_build_func(model))\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n    model, blob_conv, dim_conv, spatial_scale_conv\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n    model.CollectAndDistributeFpnRpnProposals()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n    )(blobs_in, blobs_out, name=name)\n  File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in <lambda>\n    **dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n  File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n    op = CreateOperator(op_type, inputs, outputs, **kwargs)\n"Error from operator: 
input: "gpu_0/rpn_rois_fpn2" input: "gpu_0/rpn_rois_fpn3" input: "gpu_0/rpn_rois_fpn4" input: "gpu_0/rpn_rois_fpn5" input: "gpu_0/rpn_rois_fpn6" input: "gpu_0/rpn_roi_probs_fpn2" input: "gpu_0/rpn_roi_probs_fpn3" input: "gpu_0/rpn_roi_probs_fpn4" input: "gpu_0/rpn_roi_probs_fpn5" input: "gpu_0/rpn_roi_probs_fpn6" input: "gpu_0/roidb" input: "gpu_0/im_info" output: "gpu_0/rois" output: "gpu_0/labels_int32" output: "gpu_0/bbox_targets" output: "gpu_0/bbox_inside_weights" output: "gpu_0/bbox_outside_weights" output: "gpu_0/keypoint_rois" output: "gpu_0/keypoint_locations_int32" output: "gpu_0/keypoint_weights" output: "gpu_0/keypoint_loss_normalizer" output: "gpu_0/rois_fpn2" output: "gpu_0/rois_fpn3" output: "gpu_0/rois_fpn4" output: "gpu_0/rois_fpn5" output: "gpu_0/rois_idx_restore_int32" output: "gpu_0/keypoint_rois_fpn2" output: "gpu_0/keypoint_rois_fpn3" output: "gpu_0/keypoint_rois_fpn4" output: "gpu_0/keypoint_rois_fpn5" output: "gpu_0/keypoint_rois_idx_restore_int32" name: "CollectAndDistributeFpnRpnProposalsOp:gpu_0/rpn_rois_fpn2,gpu_0/rpn_rois_fpn3,gpu_0/rpn_rois_fpn4,gpu_0/rpn_rois_fpn5,gpu_0/rpn_rois_fpn6,gpu_0/rpn_roi_probs_fpn2,gpu_0/rpn_roi_probs_fpn3,gpu_0/rpn_roi_probs_fpn4,gpu_0/rpn_roi_probs_fpn5,gpu_0/rpn_roi_probs_fpn6,gpu_0/roidb,gpu_0/im_info" type: "Python" arg { name: "grad_input_indices" } arg { name: "token" s: "forward:5" } arg { name: "grad_output_indices" } device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: "  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 281, in <module>\n    main()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 119, in main\n    checkpoints = train_model()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 128, in train_model\n    model, start_iter, checkpoints, output_dir = create_model()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/tools/train_net.py\", line 206, in create_model\n    model = model_builder.create(cfg.MODEL.TYPE, train=True)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 124, in create\n    return get_func(model_type_func)(model)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n    freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n    optim.build_data_parallel_model(model, _single_gpu_build_func)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 40, in build_data_parallel_model\n    all_loss_gradients = _build_forward_graph(model, single_gpu_build_func)\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/optimizer.py\", line 63, in _build_forward_graph\n    all_loss_gradients.update(single_gpu_build_func(model))\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/model_builder.py\", line 189, in _single_gpu_build_func\n    model, blob_conv, dim_conv, spatial_scale_conv\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/rpn_heads.py\", line 44, in add_generic_rpn_outputs\n    model.CollectAndDistributeFpnRpnProposals()\n  File \"/home/zoud/Workspace/Github/7oud/Detectron/lib/modeling/detector.py\", line 223, in CollectAndDistributeFpnRpnProposals\n    )(blobs_in, blobs_out, name=name)\n  File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2137, in <lambda>\n    **dict(chain(viewitems(kwargs), viewitems(core_kwargs)))\n  File \"/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n    op = CreateOperator(op_type, inputs, outputs, **kwargs)\n"
*** Aborted at 1520667211 (unix time) try "date -d @1520667211" if you are using GNU date ***
PC: @     0x7fb544471428 gsignal
*** SIGABRT (@0x3e8000021f7) received by PID 8695 (TID 0x7fb3af7fe700) from PID 8695; stack trace: ***
    @     0x7fb544f27390 (unknown)
    @     0x7fb544471428 gsignal
    @     0x7fb54447302a abort
    @     0x7fb531518b39 __gnu_cxx::__verbose_terminate_handler()
    @     0x7fb5315171fb __cxxabiv1::__terminate()
    @     0x7fb531517234 std::terminate()
    @     0x7fb531532c8a execute_native_thread_routine_compat
    @     0x7fb544f1d6ba start_thread
    @     0x7fb54454341d clone

System information

  • Operating system: Ubuntu 16.04
  • Compiler version: gcc 5.4.0
  • CUDA version: 8.0
  • cuDNN version: 7.0.5
  • NVIDIA driver version: 384.111
  • GPU models (for all devices if they are not all the same): 1 GTX1080Ti
  • python --version output: 2.7 anaconda

Most helpful comment

your problem is similar to #16
it seems something wrong in your yaml file.
the orign yaml is written for 8 gpu, you need to edit the SOVLER using "linear scaling rule'' which is mentioned “getting started ”.
In faster rcnn,it is that:
# Equivalent schedules with...
# 1 GPU:
BASE_LR: 0.0025
MAX_ITER: 60000
STEPS: [0, 30000, 40000]
# 2 GPUs:
BASE_LR: 0.005
MAX_ITER: 30000
STEPS: [0, 15000, 20000]
# 4 GPUs:
BASE_LR: 0.01
MAX_ITER: 15000
STEPS: [0, 7500, 10000]
# 8 GPUs:
BASE_LR: 0.02
MAX_ITER: 7500
STEPS: [0, 3750, 5000]
you need to adjust your sovler in your yaml file.
I hope it will useful to you.

All 8 comments

I changed the dataset to the original one

  DATASETS: ('keypoints_coco_2014_train', 'keypoints_coco_2014_valminusminival')

the error info changed to

I0310 17:43:43.737247 12519 operator.cc:173] Operator with engine CUDNN is not available for operator GetGPUMemoryUsage.
json_stats: {"accuracy_cls": 0.985315, "eta": "10:34:22", "iter": 240, "loss": NaN, "loss_bbox": NaN, "loss_cls": NaN, "loss_kps": NaN, "loss_rpn_bbox_fpn2": NaN, "loss_rpn_bbox_fpn3": NaN, "loss_rpn_bbox_fpn4": NaN, "loss_rpn_bbox_fpn5": 0.008082, "loss_rpn_bbox_fpn6": 0.002842, "loss_rpn_cls_fpn2": NaN, "loss_rpn_cls_fpn3": NaN, "loss_rpn_cls_fpn4": NaN, "loss_rpn_cls_fpn5": 0.010573, "loss_rpn_cls_fpn6": 0.003227, "lr": 0.013067, "mb_qsize": 64, "mem": 8230, "time": 0.424049}
CRITICAL train_net.py: 159: Loss is NaN, exiting...
INFO loader.py: 126: Stopping enqueue thread
/home/zoud/Prog/anaconda2/envs/caffe2/lib/python2.7/site-packages/numpy/lib/function_base.py:4033: RuntimeWarning: Invalid value encountered in median
  r = func(a, **kwargs)
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread
INFO loader.py: 113: Stopping mini-batch loading thread

I encountered the same problem. So have you solved it?

Not yet. Actually I have no idea at all

your problem is similar to #16
it seems something wrong in your yaml file.
the orign yaml is written for 8 gpu, you need to edit the SOVLER using "linear scaling rule'' which is mentioned “getting started ”.
In faster rcnn,it is that:
# Equivalent schedules with...
# 1 GPU:
BASE_LR: 0.0025
MAX_ITER: 60000
STEPS: [0, 30000, 40000]
# 2 GPUs:
BASE_LR: 0.005
MAX_ITER: 30000
STEPS: [0, 15000, 20000]
# 4 GPUs:
BASE_LR: 0.01
MAX_ITER: 15000
STEPS: [0, 7500, 10000]
# 8 GPUs:
BASE_LR: 0.02
MAX_ITER: 7500
STEPS: [0, 3750, 5000]
you need to adjust your sovler in your yaml file.
I hope it will useful to you.

I am training on my own dataset. Simply following the linear scaling rule will not be sufficient. You may need to disable the assertion of the negative areas (I am not sure about the consequence of this, but the model can be trained, and the output also visualizes well). You may get NaN loss errors, just further reduce the learning rate or increase the number of gpus will solve the problem.

def boxes_area(boxes):
    """Compute the area of an array of boxes."""
    w = (boxes[:, 2] - boxes[:, 0] + 1)
    h = (boxes[:, 3] - boxes[:, 1] + 1)
    areas = w * h
    if not np.all(areas >= 0):
        print("Negative areas found", boxes[areas < 0])
        areas[areas < 0] = 0.0
    # assert np.all(areas >= 0), 'Negative areas founds'
    return areas

Making BASE_LR from 0.002 to 0.000125 works for me.

@taoari
hi!
Can you explain what the assertion means please? Will it cause any problem If I just bypass the assertion?
By the way, I find that if I minish the LR, it can work.

This should be addressed by 47e457a.

Was this page helpful?
0 / 5 - 0 ratings