Detectron: Not able to run GPU for Caffe2/Detectron

Created on 16 Apr 2018 · 44Comments · Source: facebookresearch/Detectron

Operating system: Ubuntu 16.04
GPU models (for all devices if they are not all the same): GTX 1080 8GB
python --version: 2.7
Caffe2/Detectron
OS: Ubuntu 16.04
Python: 2.7
GPU: GTX 1080
gcc version: 5.4.0

I have installed CUDA, Cudnn and nccl in a conda environment and followed the steps in the installation file. I used conda (as mentioned) to install caffe2 and other libraries.

conda install -c caffe2 caffe2-cuda9.0-cudnn7

Then, to see if GPU is working, I get the following:
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode. WARNING:root:Debug message: libnccl.so.2: cannot open shared object file: No such file or directory Segmentation fault (core dumped)

I don't know what I am doing wrong or something I am missing. Please let me know.

Source

Flock1

Most helpful comment

u should install nccl2 and make sure add the path contain libnccl.so.2 to ur $LD_LIBRARY_PATH to solve that @Flock1

YefeiGao on 17 Apr 2018

👍2 🎉1

All 44 comments

Please try with Docker installation, and then install NVIDIA Docker After that try running inside the docker.

ambigus9 on 17 Apr 2018

u should install nccl2 and make sure add the path contain libnccl.so.2 to ur $LD_LIBRARY_PATH to solve that @Flock1

YefeiGao on 17 Apr 2018

👍2 🎉1

Worked!!!

Flock1 on 17 Apr 2018

@YefeiGao, @ambigus9, I was able to get caffe2 working. Now, when I test Detectron with spacial narrow python code, I get the following:

No handlers could be found for logger "caffe2.python.net_drawer"
net_drawer will not run correctly. Please install the correct dependencies.
E0417 14:47:50.412849 22020 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0417 14:47:50.412865 22020 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0417 14:47:50.412868 22020 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Found Detectron ops lib: /home/sarvagya/anaconda3/envs/caffe2/lib/libcaffe2_detectron_ops_gpu.so
E.E
======================================================================
ERROR: test_large_forward (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward
    self._run_test(A, B)
  File "tests/test_spatial_narrow_as_op.py", line 39, in _run_test
    workspace.RunOperatorOnce(op)
  File "/home/sarvagya/anaconda3/envs/caffe2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 165, in RunOperatorOnce
    return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device function Error from operator: 
input: "A" input: "B" output: "C" name: "" type: "SpatialNarrowAs" device_option { device_type: 1 cuda_gpu_id: 0 }

======================================================================
ERROR: test_small_forward_and_gradient (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
    self._run_test(A, B, check_grad=True)
  File "tests/test_spatial_narrow_as_op.py", line 39, in _run_test
    workspace.RunOperatorOnce(op)
  File "/home/sarvagya/anaconda3/envs/caffe2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 165, in RunOperatorOnce
    return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device function Error from operator: 
input: "A" input: "B" output: "C" name: "" type: "SpatialNarrowAs" device_option { device_type: 1 cuda_gpu_id: 0 }

----------------------------------------------------------------------
Ran 3 tests in 0.455s

FAILED (errors=2)

What should I do for this?

Flock1 on 17 Apr 2018

@ambigus9, I tried docker and ran
nvidia-docker run -it caffe2ai/caffe2:latest python -m caffe2.python.operator_test.relu_op_test in my conda caffe2 env and got the following output:

Ran 1 test in 4.695s

OK

So I guess it worked. But I don't know how to use docker for further applications. Any suggestions?

Flock1 on 17 Apr 2018

@Flock1 What you want to do? Maybe you can start running the examples:

python tools/infer_simple.py \
    --cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \
    --output-dir /tmp/detectron-visualizations \
    --image-ext jpg \
    --wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
    demo

If you have low VRAM (like 4 GB) of your GPU, please try something more easier as follows:

python2 tools/infer_simple.py \
  --cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml \
  --output-dir /tmp/detectron-visualizations \
  --image-ext jpg \
  --wts https://s3-us-west-2.amazonaws.com/detectron/35857389/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml.01_37_22.KSeq0b5q/output/train/coco_2014_train%3Acoco_2014_valminusminival/generalized_rcnn/model_final.pkl \
  demo

ambigus9 on 17 Apr 2018

@ambigus9, I want to do Semantic Seg on an image and possibly videos and I want to use my GPU for it. As shown in a test above, it failed and I am guessing that GPU is still now working. So I want to avoid that.

Flock1 on 17 Apr 2018

@Flock1 Did you tried the examples above?

ambigus9 on 17 Apr 2018

@ambigus9, no. I'll try it in a couple of hours. But does it use the GPU for segmentation? Also, why do you think the test failed?

Flock1 on 17 Apr 2018

@Flock1 Could you please try the test inside the Docker?

ambigus9 on 17 Apr 2018

@ambigus9, Sure. I will try that.

Flock1 on 17 Apr 2018

Hi,
I am getting a new error now when I run python2 $DETECTRON/tests/test_spatial_narrow_as_op.py:

E0421 16:01:19.569461 10741 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0421 16:01:19.569483 10741 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0421 16:01:19.569500 10741 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Found Detectron ops lib: /usr/local/lib/libcaffe2_detectron_ops_gpu.so
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device ordinal 
*** Aborted at 1524306679 (unix time) try "date -d @1524306679" if you are using GNU date ***
PC: @     0x7fe69f7ea428 gsignal
*** SIGABRT (@0x3e8000029f5) received by PID 10741 (TID 0x7fe6a0a65700) from PID 10741; stack trace: ***
    @     0x7fe6a02a0390 (unknown)
    @     0x7fe69f7ea428 gsignal
    @     0x7fe69f7ec02a abort
    @     0x7fe69151cb39 __gnu_cxx::__verbose_terminate_handler()
    @     0x7fe69151b1fb __cxxabiv1::__terminate()
    @     0x7fe69151a640 __cxa_call_terminate
    @     0x7fe69151ae6f __gxx_personality_v0
    @     0x7fe69a4b8564 _Unwind_RaiseException_Phase2
    @     0x7fe69a4b881d _Unwind_RaiseException
    @     0x7fe69151b409 __cxa_throw
    @     0x7fe68f6c2b89 caffe2::CUDAContext::~CUDAContext()
    @     0x7fe68f84e52e caffe2::Operator<>::~Operator()
    @     0x7fe6564abb0a caffe2::SpatialNarrowAsOp<>::~SpatialNarrowAsOp()
    @     0x7fe6564abb3a caffe2::SpatialNarrowAsOp<>::~SpatialNarrowAsOp()
    @     0x7fe68e694acf caffe2::Workspace::RunOperatorOnce()
    @     0x7fe69094feea _ZZN6caffe26python16addGlobalMethodsERN8pybind116moduleEENKUlRKNS1_5bytesEE26_clES6_.isra.2766.constprop.2816
    @     0x7fe690950185 _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKNS_5bytesEE26_bJS8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESQ_
    @     0x7fe69097a4bd pybind11::cpp_function::dispatcher()
    @     0x7fe6a058e645 PyEval_EvalFrameEx
    @     0x7fe6a0590519 PyEval_EvalCodeEx
    @     0x7fe6a058d4b2 PyEval_EvalFrameEx
    @     0x7fe6a0590519 PyEval_EvalCodeEx
    @     0x7fe6a058d4b2 PyEval_EvalFrameEx
    @     0x7fe6a0590519 PyEval_EvalCodeEx
    @     0x7fe6a058d4b2 PyEval_EvalFrameEx
    @     0x7fe6a0590519 PyEval_EvalCodeEx
    @     0x7fe6a05190f7 function_call
    @     0x7fe6a04f47a3 PyObject_Call
    @     0x7fe6a0589500 PyEval_EvalFrameEx
    @     0x7fe6a0590519 PyEval_EvalCodeEx
    @     0x7fe6a051900a function_call
    @     0x7fe6a04f47a3 PyObject_Call
Aborted (core dumped)

Can you help me with this?

Flock1 on 21 Apr 2018

@ambigus9, I tried the example you sent me to try. I go the following error:
ImportError: No module named pycocotools.mask

Flock1 on 21 Apr 2018

@Flock1 Could you please tell me if you are running inside the Docker? Please be careful, you must run the docker with Nvidia-Docker it as follows: nvidia-docker run -it <IMAGE ID> /bin/bash

And, when you are inside the docker, you must be inside of detectron folder and then run:

python tests/test_spatial_narrow_as_op.py

ambigus9 on 21 Apr 2018

@ambigus9, I'm running it in conda environment. I'm not very experienced with docker so learning it for now.

Flock1 on 21 Apr 2018

@Flock1 Sure, If you need help, just ask.

ambigus9 on 21 Apr 2018

@ambigus9, thanks for that. But do let me know why conda environment is messing up. What I'm thinking is I tried to install caffe2 from source. So could that be an issue?

Flock1 on 21 Apr 2018

@Flock1 On my experience i must confess that currently i think is a issue with the binary provided by Caffe2, because as you the same as you, i tried building from source, setting the variable enviroment, and it didn't work, you can check this steps

ambigus9 on 21 Apr 2018

@ambigus9, I know. I regret installing it from source. I want to know how to uninstall caffe2 if I installed like that. Conda environment seems the best option for me right now.

Flock1 on 21 Apr 2018

@YefeiGao, can you help with the issue?

Flock1 on 21 Apr 2018

@ambigus9, I ran it in docker. For python2 test_spatial_narrow_as_op.py, I am getting the following:

Traceback (most recent call last):
  File "test_spatial_narrow_as_op.py", line 88, in <module>
    utils.c2.import_detectron_ops()
  File "/detectron/lib/utils/c2.py", line 41, in import_detectron_ops
    detectron_ops_lib = envu.get_detectron_ops_lib()
  File "/detectron/lib/utils/env.py", line 71, in get_detectron_ops_lib
    ('Detectron ops lib not found; make sure that your Caffe2 '
AssertionError: Detectron ops lib not found; make sure that your Caffe2 version includes Detectron module

Flock1 on 23 Apr 2018

@Flock1 You are inside the docker and also inside /detectron folder?

ambigus9 on 23 Apr 2018

@ambigus9, yes. I'm in docker now. Also, can you help me with the errors from conda environment?

Flock1 on 23 Apr 2018

@Flock1 Are you running the docker on a conda Enviroment?

ambigus9 on 23 Apr 2018

@ambigus9, no. I'm using a conda environment.

Flock1 on 24 Apr 2018

@Flock1 Could you please try without conda enviroment?

ambigus9 on 24 Apr 2018

@ambigus9, I did. I got the Assertion error as I posted above

Flock1 on 24 Apr 2018

@Flock1 How you make the Docker image?

Following the original steps:

cd $DETECTRON/docker
docker build -t detectron:c2-cuda9-cudnn7 .

Running:

nvidia-docker run --rm -it detectron:c2-cuda9-cudnn7 python2 tests/test_batch_permutation_op.py

ambigus9 on 24 Apr 2018

🎉1

@ambigus9, I didn't follow these steps. Wow, that's my bad. I'll try that today. Thanks

Flock1 on 24 Apr 2018

@ambigus9, can you also tell me how to delete caffe2 if I installed it from source?

Flock1 on 24 Apr 2018

@Flock1 You're welcome! Anything you need just let me know.

ambigus9 on 24 Apr 2018

@Flock1 At the moment i don't know how to do that. May me you can find any inspiration from the original source Pytorch

ambigus9 on 24 Apr 2018

@ambigus9, the docker installation worked!! However, when I ran the detectron code for showing some demo, I got the following:

Found Detectron ops lib: /usr/local/lib/libcaffe2_detectron_ops_gpu.so
E0425 12:02:28.544355   272 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0425 12:02:28.544375   272 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0425 12:02:28.544394   272 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO io.py:  67: Downloading remote file https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl to /tmp/detectron-download-cache/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl

And it's working very slowly. So is my GPU being used?

Flock1 on 25 Apr 2018

@Flock1 I have the same warning, Could you complete the inference simple?

ambigus9 on 25 Apr 2018

@ambigus9, yeah. I realized that the internet was very slow, so downloading the model was taking time. I think it's finally done through docker.

Flock1 on 26 Apr 2018

@Flock1 Perfect! Anything you need i will help you.

ambigus9 on 26 Apr 2018

@ambigus9, let me know where I should contact you?

Flock1 on 26 Apr 2018

I think this issue can be closed as the problem seems to be solved.

gadcam on 19 May 2018

I dont have administrative rights on my system. So I am installing
conda install -c anaconda cudnn
conda install -c anaocnda nccl or conda installl -c pytorch nccl2
cconda install -c caffe2 caffe2-cuda9.0-cudnn7
but I get the same error
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode. WARNING:root:Debug message: libnccl.so.2: cannot open shared object file: No such file or directory Segmentation fault (core dumped)

Is it because I am installing using conda?

@Flock1 Were you able to solve this problem with @YefeiGao 's suggestion?
I do not find any libnccl.so in environment folder

garvita-tiwari on 10 Jul 2018

@garvita-tiwari Did you tried using Dockerfile instructions?

ambigus9 on 10 Jul 2018

@ambigus9 I do not have root privileges on my system and do not have docker either. Is there any way with conda? Or I should use Docker only?

garvita-tiwari on 16 Jul 2018

@garvita-tiwari Currently I recommend Docker, Maybe you can try to install Docker on your System.

ambigus9 on 23 Jul 2018

@garvita-tiwari, the WARNING that you're getting is because it's not able to locate the file. You'll need to locate the file on your system and add it to your .bashrc file.
You can also try to use conda environments as mentioned on the website.
Docker will also work.

Flock1 on 24 Jul 2018

blah blah

biswajitcsecu on 14 Sep 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

How can i train model from scratch

Hwang-dae-won · 3Comments

Only objects described in terms of polygons are included in training

realwecan · 3Comments

Output from training

fangpengcheng95 · 4Comments

ERROR: core/context_gpu.cu:343: out of memory Error from operator

743341 · 4Comments

RuntimeError: [enforce fail at conv_op_cudnn.cc:811] status == CUDNN_STATUS_SUCCESS. 8 vs 0. , Error at: /pytorch/caffe2/operators/conv_op_cudnn.cc:811: CUDNN_STATUS_EXECUTION_FAILED

Emma0928 · 3Comments