Detectron: Can not run test case and inference

Created on 9 Mar 2018 · 28Comments · Source: facebookresearch/Detectron

I installed detection, everything seems to be fine until I ran the test case

=============================

python test_spatial_narrow_as_op.py

It failed with the following message:

Found Detectron ops lib: /home/xxx/anaconda2/lib/libcaffe2_detectron_ops_gpu.so
I0308 22:07:29.731431 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
FI0308 22:07:30.126979 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.382014 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
.I0308 22:07:30.383860 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.384814 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAs.
I0308 22:07:30.385095 29562 operator.cc:173] Operator with engine CUDNN is not available for operator SpatialNarrowAsGradient.

E

ERROR: test_small_forward_and_gradient (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])

success = RunOperatorOnce(op)

File "/home/xxxx/anaconda2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

======================================================================

FAIL: test_large_forward (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/xxx/anaconda2/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)

raise AssertionError(msg)

AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08

(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[ 3.099715e-01, -1.291913e+00, -2.825952e-01, ...,
-2.258663e-01, -8.814982e-01, 4.408140e-01],
[ 1.377446e+00, 1.170039e+00, 1.164714e-01, ...,...

Ran 3 tests in 1.078s

FAILED (failures=1, errors=1)

=======================================================

If I run with the inference

python2 tools/infer_simple.py \
--cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \
--output-dir /tmp/detectron-visualizations \
--image-ext jpg \
--wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
demo

I got the following error:

I0308 22:03:05.297256 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.297796 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298099 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298406 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298660 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.298704 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299007 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299317 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299623 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.299666 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.299965 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300297 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300607 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.300649 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.300714 31934 operator.cc:173] Operator with engine CUDNN is not available for operator StopGradient.
I0308 22:03:05.300990 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301300 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301609 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301867 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.301910 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.302211 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302521 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302832 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.302876 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.303180 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303493 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303802 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.303844 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.304164 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304476 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304787 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.304831 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sum.
I0308 22:03:05.305145 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.
I0308 22:03:05.305461 31934 operator.cc:173] Operator with engine CUDNN is not available for operator AffineChannel.

I0308 22:03:05.460626 31934 operator.cc:173] Operator with engine CUDNN is not available for operator Sigmoid.
I0308 22:03:05.460695 31934 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 1.714e-05 secs
I0308 22:03:05.460738 31934 net_dag.cc:61] Number of parallel execution chains 5 Number of operators = 18
INFO infer_simple.py: 111: Processing demo/16004479832_a748d55f21_k.jpg -> /tmp/detectron-visualizations/16004479832_a748d55f21_k.jpg.pdf
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File \"tools/infer_simple.py\", line 147, in \n main(args)\n File
* Aborted at 1520575410 (unix time) try "date -d @1520575410" if you are using GNU date
PC: @ 0x7f1d09bad428 gsignal
SIGABRT (@0x3e800007cbe) received by PID 31934 (TID 0x7f1cba292700) from PID 31934; stack trace: *
@ 0x7f1d0a663390 (unknown)
@ 0x7f1d09bad428 gsignal
@ 0x7f1d09baf02a abort
@ 0x7f1d031bdb39 __gnu_cxx::__verbose_terminate_handler()
@ 0x7f1d031bc1fb __cxxabiv1::__terminate()
@ 0x7f1d031bc234 std::terminate()
@ 0x7f1d031d7c8a execute_native_thread_routine_compat
@ 0x7f1d0a6596ba start_thread
@ 0x7f1d09c7f41d clone
Aborted

Operating system: Ubuntu
Compiler version: gcc
CUDA version: 9.0
cuDNN version: 7.0
NVIDIA driver version: ?
GPU models (for all devices if they are not all the same): TITAN
PYTHONPATH environment variable: ?
python --version output: ?
Anything else that seems relevant: ?

Source

deeprun

Most helpful comment

Ok. I have a working detectron now.
My solution was using the docker image path. It works. does not matter whether you have cuda 9 or 8, cudnn 7 or 6, whatever caffe2 version... it works!
And @AgrawalAmey it works in Azure Linux DSVM. :D

BanuSelinTosun on 13 Jul 2018

👍3

All 28 comments

Similar errors:

On python tests/test_spatial_narrow_as_op.py:

Found Detectron ops lib: /home/xxxx/anaconda3/envs/detectron/lib/libcaffe2_detectron_ops_gpu.so

F.E

ERROR: test_small_forward_and_gradient (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "tests/test_spatial_narrow_as_op.py", line 49, in _run_test
res, grad, grad_estimated = gc.CheckSimple(op, [A, B], 0, [0])
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 284, in CheckSimple
outputs_with_grads
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/gradient_checker.py", line 201, in GetLossAndGrad
workspace.RunOperatorsOnce(grad_ops)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 184, in RunOperatorsOnce
success = RunOperatorOnce(op)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 179, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

======================================================================

FAIL: test_large_forward (main.SpatialNarrowAsOpTest)

Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "tests/test_spatial_narrow_as_op.py", line 54, in _run_test
np.testing.assert_allclose(C, C_ref, rtol=1e-5, atol=1e-08)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 1396, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/numpy/testing/nose_tools/utils.py", line 779, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-05, atol=1e-08

(mismatch 100.0%)
x: array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],...
y: array([[[[-1.243985, -2.407127, 1.165339, ..., -0.023202, -0.096644,
-0.096511],
[-0.640857, -0.977031, 0.745425, ..., -0.049333, -1.520961,...

Ran 3 tests in 0.519s

FAILED (failures=1, errors=1)

On python2 tools/infer_simple.py --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml --output-dir /tmp/detectron-visualizations --image-ext jpg --wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl demo

WARNING cnn.py: 40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
INFO net.py: 57: Loading weights from: /tmp/detectron-download-cache/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
I0312 13:09:24.344396 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000140145 secs
I0312 13:09:24.344605 378 net_dag.cc:61] Number of parallel execution chains 63 Number of operators = 402
I0312 13:09:24.362937 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000125812 secs
I0312 13:09:24.363134 378 net_dag.cc:61] Number of parallel execution chains 30 Number of operators = 358
I0312 13:09:24.364900 378 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 8.807e-06 secs
I0312 13:09:24.364929 378 net_dag.cc:61] Number of parallel execution chains 5 Number of operators = 18
INFO infer_simple.py: 111: Processing demo/24274813513_0cfd2ce6d0_k.jpg -> /tmp/detectron-visualizations/24274813513_0cfd2ce6d0_k.jpg.pdf
E0312 13:09:24.806742 393 net_dag.cc:203] Exception from operator '' (type 'Sum'): caffe2::EnforceNotMet: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File \"tools/infer_simple.py\", line 147, in \n main(args)\n File \"tools/infer_simple.py\", line 99, in main\n model = infer_engine.initialize_model_from_cfg()\n File \"/home/xxxx/opt/detectron/lib/core/test_engine.py\", line 266, in initialize_model_from_cfg\n model = model_builder.create(cfg.MODEL.TYPE, train=False, gpu_id=gpu_id)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/xxxx/opt/detectron/lib/modeling/optimizer.py\", line 54, in build_data_parallel_model\n single_gpu_build_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 169, in _single_gpu_build_func\n blob_conv, dim_conv, spatial_scale_conv = add_conv_body_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/FPN.py\", line 62, in add_fpn_ResNet101_conv5_body\n model, ResNet.add_ResNet101_conv5_body, fpn_level_info_ResNet101_conv5\n File \"/home/xxxx/opt/detectron/lib/modeling/FPN.py\", line 103, in add_fpn_onto_conv_body\n conv_body_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 46, in add_ResNet101_conv5_body\n return add_ResNet_convX_body(model, (3, 4, 23, 3))\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 101, in add_ResNet_convX_body\n s, dim_in = add_stage(model, \'res2\', p, n1, dim_in, 256, dim_bottleneck, 1)\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 83, in add_stage\n inplace_sum=i < n - 1\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 187, in add_residual_block\n s = model.net.Sum([tr, sc], tr)\n File \"/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py\", line 2047, in \n op_type, args, *kwargs)\n File \"/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, *kwargs)\n"
Original python traceback for operator 14 in network generalized_rcnn in exception above (most recent call last):
File "tools/infer_simple.py", line 147, in
File "tools/infer_simple.py", line 99, in main
File "/home/xxxx/opt/detectron/lib/core/test_engine.py", line 266, in initialize_model_from_cfg
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 124, in create
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 89, in generalized_rcnn
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 229, in build_generic_detection_model
File "/home/xxxx/opt/detectron/lib/modeling/optimizer.py", line 54, in build_data_parallel_model
File "/home/xxxx/opt/detectron/lib/modeling/model_builder.py", line 169, in _single_gpu_build_func
File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 62, in add_fpn_ResNet101_conv5_body
File "/home/xxxx/opt/detectron/lib/modeling/FPN.py", line 103, in add_fpn_onto_conv_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 46, in add_ResNet101_conv5_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 101, in add_ResNet_convX_body
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 83, in add_stage
File "/home/xxxx/opt/detectron/lib/modeling/ResNet.py", line 187, in add_residual_block
Traceback (most recent call last):
File "tools/infer_simple.py", line 147, in
main(args)
File "tools/infer_simple.py", line 117, in main
model, im, None, timers=timers
File "/home/xxxx/opt/detectron/lib/core/test.py", line 65, in im_detect_all
scores, boxes, im_scales = im_detect_bbox(model, im, box_proposals)
File "/home/xxxx/opt/detectron/lib/core/test.py", line 154, in im_detect_bbox
workspace.RunNet(model.net.Proto().name)
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 230, in RunNet
StringifyNetName(name), num_iter, allow_fail,
File "/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/workspace.py", line 192, in CallWithExceptionIntercept
return func(args, *kwargs)
RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "gpu_0/res2_0_branch2c_bn" input: "gpu_0/res2_0_branch1_bn" output: "gpu_0/res2_0_branch2c_bn" name: "" type: "Sum" device_option { device_type: 1 cuda_gpu_id: 0 } debug_info: " File \"tools/infer_simple.py\", line 147, in \n main(args)\n File \"tools/infer_simple.py\", line 99, in main\n model = infer_engine.initialize_model_from_cfg()\n File \"/home/xxxx/opt/detectron/lib/core/test_engine.py\", line 266, in initialize_model_from_cfg\n model = model_builder.create(cfg.MODEL.TYPE, train=False, gpu_id=gpu_id)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 124, in create\n return get_func(model_type_func)(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 89, in generalized_rcnn\n freeze_conv_body=cfg.TRAIN.FREEZE_CONV_BODY\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 229, in build_generic_detection_model\n optim.build_data_parallel_model(model, _single_gpu_build_func)\n File \"/home/xxxx/opt/detectron/lib/modeling/optimizer.py\", line 54, in build_data_parallel_model\n single_gpu_build_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/model_builder.py\", line 169, in _single_gpu_build_func\n blob_conv, dim_conv, spatial_scale_conv = add_conv_body_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/FPN.py\", line 62, in add_fpn_ResNet101_conv5_body\n model, ResNet.add_ResNet101_conv5_body, fpn_level_info_ResNet101_conv5\n File \"/home/xxxx/opt/detectron/lib/modeling/FPN.py\", line 103, in add_fpn_onto_conv_body\n conv_body_func(model)\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 46, in add_ResNet101_conv5_body\n return add_ResNet_convX_body(model, (3, 4, 23, 3))\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 101, in add_ResNet_convX_body\n s, dim_in = add_stage(model, \'res2\', p, n1, dim_in, 256, dim_bottleneck, 1)\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 83, in add_stage\n inplace_sum=i < n - 1\n File \"/home/xxxx/opt/detectron/lib/modeling/ResNet.py\", line 187, in add_residual_block\n s = model.net.Sum([tr, sc], tr)\n File \"/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py\", line 2047, in \n op_type, *args, *kwargs)\n File \"/home/xxxx/anaconda3/envs/detectron/lib/python2.7/site-packages/caffe2/python/core.py\", line 2024, in _CreateAndAddToSelf\n op = CreateOperator(op_type, inputs, outputs, **kwargs)\n"

OS: Ubuntu 17.04
CUDA 9.0
cuDNN 7
NVIDIA Driver 390.12
GPU: TITAN Xp
$PYTHONPATH: empty
python --version: Python 2.7.14 :: Anaconda, Inc.

mlprt on 12 Mar 2018

same here

gecong on 14 Mar 2018

any update on this?

anatlin on 18 Mar 2018

@gecong @anatlin

RuntimeError: [enforce fail at context_gpu.h:171] . Encountered CUDA error: no kernel image is available for execution on the device Error from operator:
input: "A" input: "B" input: "C_grad" output: "A_grad" name: "" type: "SpatialNarrowAsGradient" device_option { device_type: 1 cuda_gpu_id: 0 } is_gradient_op: true

I solved this by adding export PYTHONPATH=$PYTHONPATH:/home/user/caffe2/build in bashrc file

xmengli999 on 19 Mar 2018

@xmengli999

I do not seem to have a caffe2/build director on my machine.

I installed caffe with anaconda, and I have the following directories under

~/anaconda2/pkgs/caffe2-cuda9.0-cudnn7-0.8.dev-py27h4e2c0f2_0$ ls -ltr
total 24
drwxrwxr-x 3 gcong gcong 4096 Mar 7 22:04 share
drwxrwxr-x 6 gcong gcong 4096 Mar 7 22:04 include
drwxrwxr-x 2 gcong gcong 4096 Mar 7 22:04 bin
drwxrwxr-x 2 gcong gcong 4096 Mar 7 22:04 test
drwxrwxr-x 4 gcong gcong 4096 Mar 7 22:04 lib
drwxrwxr-x 4 gcong gcong 4096 Mar 7 22:04 info

Could you let me know how I can change the PYTHONPATH?

Thanks a lot

deeprun on 20 Mar 2018

👍1

@rbgirshick is this a caffe2 issue?

anatlin on 21 Mar 2018

I experience the same issue, any update?

mihaifieraru on 25 Mar 2018

same issue here!

ghost on 3 Apr 2018

@deeprun I build from source. You can have a try.

xmengli999 on 4 Apr 2018

same problem. haven't tried to build caffe from sources
opensuse tumbleweed, cuda 9.0 cudnn 7.1

olegantonyan on 6 Apr 2018

The same. cuda 9.0 cudnn 7.1

gzaripov on 10 Apr 2018

Same here, cuda 9.0 and cudnn 7.1.2

rafagjordana on 18 Apr 2018

Facing the same issue on azure data science vm, running on ubuntu 16.04, anaconda 2 and tesla p40. The build directory is included in PYTHONPATH.

AgrawalAmey on 19 Apr 2018

👍1

Same problem

apli on 20 Apr 2018

Bump

wuharvey on 2 May 2018

Same problem, any update?

nangongtianyi on 3 May 2018

Me too
EDIT : I already fixed this by

I uninstalled CUDA 9.1, because My GPU is Quadro M4000, which support only CUDA 8.0

sudo apt-get remove cuda-9.1
sudo apt-get install cuda-8.0

Make sure to install matched cuDNN version (7.1.2) and NCCL for CUDA 8.0
Uninstall Caffe2 and install again use conda install -c caffe2 caffe2_cuda8.0_cudnn7
Fix .bashrc point to the right directory

export PATH="~/anaconda3/bin:$PATH:/usr/local/cuda-8.0/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-8.0/lib64:$LD_LIBRARY_PATH"

So it's that you have to make sure your GPU support right version of CUDA

macsermkiat on 3 May 2018

@AgrawalAmey
I have been circling around the same issue for almost 1-1.5 weeks now on Azure Linux Ubuntu 16.04 DSVM. Did you come up with a resolution?

BanuSelinTosun on 11 Jul 2018

Looks like this PR is trying to solve the problem (or a part of it at least) https://github.com/pytorch/pytorch/pull/7062, can you still reproduce if you use a version of Caffe2/PyTorch including this commit ?

Another track to follow could be https://github.com/fireice-uk/xmr-stak-nvidia/issues/159#issuecomment-337263030 (see http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/ to get the correct CUDA_ARCH number)

gadcam on 12 Jul 2018

@BanuSelinTosun Just as a side question : did you try this ? https://docs.microsoft.com/en-us/azure/machine-learning/data-science-virtual-machine/dsvm-ubuntu-intro#caffe2

gadcam on 12 Jul 2018

@BanuSelimTosun Sorry, I couldn't find any solution for the issue.

AgrawalAmey on 12 Jul 2018

@gadcam
Yes, that was the very 1st thing I tried. The problem with Azure DSVMs are they already have Cuda 9 with cudnn 7 where Detectron want Cuda 8 & cudnn 6. There is caffe2 installation with cuda 9 and cudnn7 and it is a) not working with detectron due to the version and also it is installed in python 3 not python 2, b) conflicting with new caffe2 installations when cuda 8 & cudnn 6 is installed. Even if I create everything in a new environment.

@AgrawalAmey
I tried to file an issue on this to Azure computing before July 4th, and they are not taking it very seriously. I even talked to one of the Principal Manager in Azure. He just suggested me new approaches which did not work.

BanuSelinTosun on 12 Jul 2018

@BanuSelinTosun I had to install the Detectron in a very similar environment.

What I would do (I do not know if it is possible in your environment)

Uninstall Caffe2
Uninstall CUDA & cuDNN
Reinstall correct versions of CUDA & cuDNN (you could also have to switch the GC driver version in some setup if I recall correctly)
Build Caffe2 again specifying a CUDA_ARCH (you could also have to check that it links to correct and for Python PYTHON_EXECUTABLE / PYTHON_INCLUDE_DIR / PYTHON_LIBRARY
Run the tests
If everything goes well up to this point than you should be able to install the Detectron

I think it will not be that easy but maybe it will raise some new errors which will give us some new hints to go further.

BTW I do not think Detectron needs CUDA 8: my install is with CUDA Version 9.0.176 & cuDNN 7.0.5.

EDIT : maybe this can be of some help https://docs.microsoft.com/en-US/azure/virtual-machines/linux/n-series-driver-setup#ubuntu-1604-lts

gadcam on 13 Jul 2018

@gadcam, thank you for helping with this issue.
I had working caffe2 in the Azure DSVMs with Cuda 9 and cudnn 7
It should be working with those if it worked for you.
I can install (run the make file) of detectron, that does not have a problem. But when I run the test for the detectron, I am getting that Failures=1, errors=1 error and failing.

It always circles back to the same problem as it seems. :-(

I did not try running with inference, should I try that first?

On Thu, Jul 12, 2018 at 3:37 PM, Camille Barneaud notifications@github.com
wrote:

@BanuSelinTosun https://github.com/BanuSelinTosun I had to install the
Detectron in a very similar environment.

What I would do (I do not know if it is possible in your environment)

Uninstall Caffe2

Uninstall CUDA & cuDNN

Reinstall correct versions of CUDA & cuDNN (you could also have to
switch the GC driver version in some setup if I recall correctly)

Build Caffe2 again specifying a CUDA_ARCH (you could also have to
check that it links to correct and for Python PYTHON_EXECUTABLE /
PYTHON_INCLUDE_DIR / PYTHON_LIBRARY

Run the tests

If everything goes well up to this point than you should be able to
install the Detectron

I think it will not be that easy but maybe it will raise some new errors
which will give us some new hints to go further.

BTW I do not think Detectron needs CUDA 8: my install is with CUDA Version
9.0.176 & cuDNN 7.0.5.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/facebookresearch/Detectron/issues/260#issuecomment-404672005,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AaMxPVOVdlhkkoaG7i7-8M4widWNRMYtks5uF8-ogaJpZM4SjzRk
.

BanuSelinTosun on 13 Jul 2018

👍3

Ok. I have a working detectron now.
My solution was using the docker image path. It works. does not matter whether you have cuda 9 or 8, cudnn 7 or 6, whatever caffe2 version... it works!
And @AgrawalAmey it works in Azure Linux DSVM. :D

Could you explain your solution a bit more in details? I have this problem and I have a hard time to solve it.

remcova on 10 Oct 2018

@BanuSelinTosun : Could you please explain your solution using docker in more detail?

paritoshgote on 31 Oct 2018

I met the similar error when I want to use the tensorflow op compiled by nvcc: Could not launch cub::DeviceSegmentedRadixSort::SortPairsDescending to sort input, temp_storage_bytes: 599295, status: no kernel image is available for execution on the device. I found this issue and knew that it's caused by the gpu compute capability. I use Tesla40 and add -gencode arch=compute_61,code=compute_61 my compile file. Solved it finally. Hope it can help you.