Detectron: [enforce fail at conv_pool_op_base.h:253] training my own keypoint dataset

Created on 6 Apr 2018 · 8Comments · Source: facebookresearch/Detectron

when I train my own keypoint datase, after 39280 iterations, errors appear as below:
RuntimeError: [enforce fail at conv_pool_op_base.h:253] input.size() > 0. Error from operator:
input: "gpu_1/_[pose]_roi_feat" input: "gpu_1/conv_fcn1_w" input: "gpu_1/conv_fcn1_b" output: "gpu_1/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 1 } engine: "CUDNN"

I really appreciate it if someone can help me solve this problem.

Source

geyuying

Most helpful comment

I am facing the same problem when training for rcnn for mask and keypoints.

As a temporary solution, changing line 66 in train.py works:

try:
    workspace.RunNet(model.net.Proto().name)
except:
    logger.warn("Error in iter {}".format(cur_iter))

@pppoe in case you managed to solve the problem by checking your annotated images, did you automatize it with a script? If yes, could you please post it here?

Thanks in advance

orestis-z on 26 Jul 2018

👍2

All 8 comments

@geyuying Did you solve this problem? I face the same error when I train the keypoint model after 3100 iterations. In addition, recovering from a snapshot will encounter the same error after another iteration. I guess the error comes from the data, but my dataset is COCO2014. Did you try to train on the COCO dataset?

Thank you!

ChongZhangZC on 10 Apr 2018

same error here

hxli-aibee on 12 Apr 2018

When I train my own dataset with MaskRCNN(with ketpoints),I get the similiar problem,but it seems no one has solved this problem ?
E0605 17:11:07.014443 4777 net_dag.cc:188] Exception from operator chain starting at '' (type 'Concat'): caffe2::EnforceNotMet: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator:
input: "gpu_0/_[pose]_roi_feat" input: "gpu_0/conv_fcn1_w" input: "gpu_0/conv_fcn1_b" output: "gpu_0/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"
I0605 17:11:07.014550 4776 context_gpu.cu:305] GPU 0: 5657 MB
I0605 17:11:07.014562 4776 context_gpu.cu:309] Total: 5657 MB
WARNING workspace.py: 185: Original python traceback for operator 283 in network generalized_rcnn in exception above (most recent call last):
WARNING workspace.py: 190: File "tools/train_net.py", line 128, in
WARNING workspace.py: 190: File "tools/train_net.py", line 110, in main
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 53, in train_model
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 132, in create_model
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 124, in create
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 89, in generalized_rcnn
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 229, in build_generic_detection_model
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/optimizer.py", line 40, in build_data_parallel_model
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/optimizer.py", line 63, in _build_forward_graph
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 217, in _single_gpu_build_func
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/model_builder.py", line 302, in _add_roi_keypoint_head
WARNING workspace.py: 190: File "/home/scau2/Downloads/Detectron-master/detectron/modeling/keypoint_rcnn_heads.py", line 214, in add_roi_pose_head_v1convX
WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/cnn.py", line 169, in Relu
WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/brew.py", line 106, in scope_wrapper
WARNING workspace.py: 190: File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/helpers/nonlinearity.py", line 36, in relu
Traceback (most recent call last):
File "tools/train_net.py", line 128, in
main()
File "tools/train_net.py", line 110, in main
checkpoints = detectron.utils.train.train_model()
File "/home/scau2/Downloads/Detectron-master/detectron/utils/train.py", line 65, in train_model
workspace.RunNet(model.net.Proto().name)
File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 217, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/scau2/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 178, in CallWithExceptionIntercept
return func(args, *kwargs)
RuntimeError: [enforce fail at conv_pool_op_base.h:237] input.size() > 0. Error from operator:
input: "gpu_0/_[pose]_roi_feat" input: "gpu_0/conv_fcn1_w" input: "gpu_0/conv_fcn1_b" output: "gpu_0/conv_fcn1" name: "" type: "Conv" arg { name: "kernel" i: 3 } arg { name: "exhaustive_search" i: 0 } arg { name: "pad" i: 1 } arg { name: "order" s: "NCHW" } arg { name: "stride" i: 1 } device_option { device_type: 1 cuda_gpu_id: 0 } engine: "CUDNN"

CPFelix on 5 Jun 2018

@rbgirshick can you give us some tips about this problem ,was it because the bug in caffe2 or the orgin maskrcnn does't support other keypoints dataset?I am getting struck here

CPFelix on 7 Jun 2018

I ran into this same issue and found the error is from training images with 0 valid objects. You may check your annotated images to see if there is any one with zero ground-truth boxes or key points

pppoe on 16 Jun 2018

I am facing the same problem when training for rcnn for mask and keypoints.

As a temporary solution, changing line 66 in train.py works:

try:
    workspace.RunNet(model.net.Proto().name)
except:
    logger.warn("Error in iter {}".format(cur_iter))

@pppoe in case you managed to solve the problem by checking your annotated images, did you automatize it with a script? If yes, could you please post it here?

Thanks in advance

orestis-z on 26 Jul 2018

👍2

@pppoe, Thanks I had the same issue. I also found the bounding box coordinates must be confined to the image resolution (i.e. no negative coordinate values.) or an exception is thrown.

artificialbrains on 31 Aug 2018

I ran into the same issue too when I was training e2e_keypoint_rcnn with IMS_PER_BATCH = 1.
Thanks @zamponotiropita for posting the temporary solution, but it may miss some training data when applying multi-gpu-training.
A better way to handle this is to modify the line 61-81 in keypoint_rcnn.py to the following codes.

if kp_fg_inds.shape[0] > 0:
    sampled_fg_rois = roidb['boxes'][kp_fg_inds]
    box_to_gt_ind_map = roidb['box_to_gt_ind_map'][kp_fg_inds]

    num_keypoints = gt_keypoints.shape[2]
    sampled_keypoints = -np.ones(
        (len(sampled_fg_rois), gt_keypoints.shape[1], num_keypoints),
        dtype=gt_keypoints.dtype
    )
    for ii in range(len(sampled_fg_rois)):
        ind = box_to_gt_ind_map[ii]
        if ind >= 0:
            sampled_keypoints[ii, :, :] = gt_keypoints[gt_inds[ind], :, :]
            assert np.sum(sampled_keypoints[ii, 2, :]) > 0

    heats, weights = keypoint_utils.keypoints_to_heatmap_labels(
        sampled_keypoints, sampled_fg_rois, M=cfg.KRCNN.HEATMAP_SIZE
    )

    shape = (sampled_fg_rois.shape[0] * cfg.KRCNN.NUM_KEYPOINTS, 1)
    heats = heats.reshape(shape)
    weights = weights.reshape(shape)

else:# If there are no fg keypoint rois (it does happen)
    # The network cannot handle empty blobs, so we must provide a heatmap
    # We simply take the first bg roi, given it an all zero heatmap, and
    # set its weights to zero (ignore label).
    roi_inds = np.where(roidb['gt_classes'] == 0)[0]
    # sampled_fg_rois is actually one random roi, it's fine.
    sampled_fg_rois = roidb['boxes'][roi_inds[0]].reshape((1, -1))
    # We give it an 0's blob 
    heats = blob_utils.zeros((1 * cfg.KRCNN.NUM_KEYPOINTS, 1))
    # We set weights to 0, so the loss won't consider this.
    weights = blob_utils.zeros((1 * cfg.KRCNN.NUM_KEYPOINTS, 1))