I have installed CUDA, Cudnn and nccl in a conda environment and followed the steps in the installation file. I used conda (as mentioned) to install caffe2 and other libraries.
conda install -c caffe2 caffe2-cuda9.0-cudnn7
Then, to see if GPU is working, I get the following:
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode.
WARNING:root:Debug message: libnccl.so.2: cannot open shared object file: No such file or directory
Segmentation fault (core dumped)
I don't know what I am doing wrong or something I am missing. Please let me know.
Please try with Docker installation, and then install NVIDIA Docker After that try running inside the docker.
u should install nccl2 and make sure add the path contain libnccl.so.2 to ur $LD_LIBRARY_PATH to solve that @Flock1
Worked!!!
@YefeiGao, @ambigus9, I was able to get caffe2 working. Now, when I test Detectron with spacial narrow python code, I get the following:
No handlers could be found for logger "caffe2.python.net_drawer"
net_drawer will not run correctly. Please install the correct dependencies.
E0417 14:47:50.412849 22020 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0417 14:47:50.412865 22020 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0417 14:47:50.412868 22020 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Found Detectron ops lib: /home/sarvagya/anaconda3/envs/caffe2/lib/libcaffe2_detectron_ops_gpu.so
E.E
======================================================================
ERROR: test_large_forward (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 68, in test_large_forward
self._run_test(A, B)
File "tests/test_spatial_narrow_as_op.py", line 39, in _run_test
workspace.RunOperatorOnce(op)
File "/home/sarvagya/anaconda3/envs/caffe2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 165, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device function Error from operator:
input: "A" input: "B" output: "C" name: "" type: "SpatialNarrowAs" device_option { device_type: 1 cuda_gpu_id: 0 }
======================================================================
ERROR: test_small_forward_and_gradient (__main__.SpatialNarrowAsOpTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/test_spatial_narrow_as_op.py", line 59, in test_small_forward_and_gradient
self._run_test(A, B, check_grad=True)
File "tests/test_spatial_narrow_as_op.py", line 39, in _run_test
workspace.RunOperatorOnce(op)
File "/home/sarvagya/anaconda3/envs/caffe2/lib/python2.7/site-packages/caffe2/python/workspace.py", line 165, in RunOperatorOnce
return C.run_operator_once(StringifyProto(operator))
RuntimeError: [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device function Error from operator:
input: "A" input: "B" output: "C" name: "" type: "SpatialNarrowAs" device_option { device_type: 1 cuda_gpu_id: 0 }
----------------------------------------------------------------------
Ran 3 tests in 0.455s
FAILED (errors=2)
What should I do for this?
@ambigus9, I tried docker and ran
nvidia-docker run -it caffe2ai/caffe2:latest python -m caffe2.python.operator_test.relu_op_test in my conda caffe2 env and got the following output:
Ran 1 test in 4.695s
OK
So I guess it worked. But I don't know how to use docker for further applications. Any suggestions?
@Flock1 What you want to do? Maybe you can start running the examples:
python tools/infer_simple.py \
--cfg configs/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml \
--output-dir /tmp/detectron-visualizations \
--image-ext jpg \
--wts https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl \
demo
If you have low VRAM (like 4 GB) of your GPU, please try something more easier as follows:
python2 tools/infer_simple.py \
--cfg configs/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml \
--output-dir /tmp/detectron-visualizations \
--image-ext jpg \
--wts https://s3-us-west-2.amazonaws.com/detectron/35857389/12_2017_baselines/e2e_faster_rcnn_R-50-FPN_2x.yaml.01_37_22.KSeq0b5q/output/train/coco_2014_train%3Acoco_2014_valminusminival/generalized_rcnn/model_final.pkl \
demo
@ambigus9, I want to do Semantic Seg on an image and possibly videos and I want to use my GPU for it. As shown in a test above, it failed and I am guessing that GPU is still now working. So I want to avoid that.
@Flock1 Did you tried the examples above?
@ambigus9, no. I'll try it in a couple of hours. But does it use the GPU for segmentation? Also, why do you think the test failed?
@Flock1 Could you please try the test inside the Docker?
@ambigus9, Sure. I will try that.
Hi,
I am getting a new error now when I run python2 $DETECTRON/tests/test_spatial_narrow_as_op.py:
E0421 16:01:19.569461 10741 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0421 16:01:19.569483 10741 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0421 16:01:19.569500 10741 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
Found Detectron ops lib: /usr/local/lib/libcaffe2_detectron_ops_gpu.so
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
what(): [enforce fail at context_gpu.h:155] . Encountered CUDA error: invalid device ordinal
*** Aborted at 1524306679 (unix time) try "date -d @1524306679" if you are using GNU date ***
PC: @ 0x7fe69f7ea428 gsignal
*** SIGABRT (@0x3e8000029f5) received by PID 10741 (TID 0x7fe6a0a65700) from PID 10741; stack trace: ***
@ 0x7fe6a02a0390 (unknown)
@ 0x7fe69f7ea428 gsignal
@ 0x7fe69f7ec02a abort
@ 0x7fe69151cb39 __gnu_cxx::__verbose_terminate_handler()
@ 0x7fe69151b1fb __cxxabiv1::__terminate()
@ 0x7fe69151a640 __cxa_call_terminate
@ 0x7fe69151ae6f __gxx_personality_v0
@ 0x7fe69a4b8564 _Unwind_RaiseException_Phase2
@ 0x7fe69a4b881d _Unwind_RaiseException
@ 0x7fe69151b409 __cxa_throw
@ 0x7fe68f6c2b89 caffe2::CUDAContext::~CUDAContext()
@ 0x7fe68f84e52e caffe2::Operator<>::~Operator()
@ 0x7fe6564abb0a caffe2::SpatialNarrowAsOp<>::~SpatialNarrowAsOp()
@ 0x7fe6564abb3a caffe2::SpatialNarrowAsOp<>::~SpatialNarrowAsOp()
@ 0x7fe68e694acf caffe2::Workspace::RunOperatorOnce()
@ 0x7fe69094feea _ZZN6caffe26python16addGlobalMethodsERN8pybind116moduleEENKUlRKNS1_5bytesEE26_clES6_.isra.2766.constprop.2816
@ 0x7fe690950185 _ZZN8pybind1112cpp_function10initializeIZN6caffe26python16addGlobalMethodsERNS_6moduleEEUlRKNS_5bytesEE26_bJS8_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESQ_
@ 0x7fe69097a4bd pybind11::cpp_function::dispatcher()
@ 0x7fe6a058e645 PyEval_EvalFrameEx
@ 0x7fe6a0590519 PyEval_EvalCodeEx
@ 0x7fe6a058d4b2 PyEval_EvalFrameEx
@ 0x7fe6a0590519 PyEval_EvalCodeEx
@ 0x7fe6a058d4b2 PyEval_EvalFrameEx
@ 0x7fe6a0590519 PyEval_EvalCodeEx
@ 0x7fe6a058d4b2 PyEval_EvalFrameEx
@ 0x7fe6a0590519 PyEval_EvalCodeEx
@ 0x7fe6a05190f7 function_call
@ 0x7fe6a04f47a3 PyObject_Call
@ 0x7fe6a0589500 PyEval_EvalFrameEx
@ 0x7fe6a0590519 PyEval_EvalCodeEx
@ 0x7fe6a051900a function_call
@ 0x7fe6a04f47a3 PyObject_Call
Aborted (core dumped)
Can you help me with this?
@ambigus9, I tried the example you sent me to try. I go the following error:
ImportError: No module named pycocotools.mask
@Flock1 Could you please tell me if you are running inside the Docker? Please be careful, you must run the docker with Nvidia-Docker it as follows: nvidia-docker run -it <IMAGE ID> /bin/bash
And, when you are inside the docker, you must be inside of detectron folder and then run:
python tests/test_spatial_narrow_as_op.py
@ambigus9, I'm running it in conda environment. I'm not very experienced with docker so learning it for now.
@Flock1 Sure, If you need help, just ask.
@ambigus9, thanks for that. But do let me know why conda environment is messing up. What I'm thinking is I tried to install caffe2 from source. So could that be an issue?
@Flock1 On my experience i must confess that currently i think is a issue with the binary provided by Caffe2, because as you the same as you, i tried building from source, setting the variable enviroment, and it didn't work, you can check this steps
@ambigus9, I know. I regret installing it from source. I want to know how to uninstall caffe2 if I installed like that. Conda environment seems the best option for me right now.
@YefeiGao, can you help with the issue?
@ambigus9, I ran it in docker. For python2 test_spatial_narrow_as_op.py, I am getting the following:
Traceback (most recent call last):
File "test_spatial_narrow_as_op.py", line 88, in <module>
utils.c2.import_detectron_ops()
File "/detectron/lib/utils/c2.py", line 41, in import_detectron_ops
detectron_ops_lib = envu.get_detectron_ops_lib()
File "/detectron/lib/utils/env.py", line 71, in get_detectron_ops_lib
('Detectron ops lib not found; make sure that your Caffe2 '
AssertionError: Detectron ops lib not found; make sure that your Caffe2 version includes Detectron module
@Flock1 You are inside the docker and also inside /detectron folder?
@ambigus9, yes. I'm in docker now. Also, can you help me with the errors from conda environment?
@Flock1 Are you running the docker on a conda Enviroment?
@ambigus9, no. I'm using a conda environment.
@Flock1 Could you please try without conda enviroment?
@ambigus9, I did. I got the Assertion error as I posted above
@Flock1 How you make the Docker image?
Following the original steps:
cd $DETECTRON/docker
docker build -t detectron:c2-cuda9-cudnn7 .
Running:
nvidia-docker run --rm -it detectron:c2-cuda9-cudnn7 python2 tests/test_batch_permutation_op.py
@ambigus9, I didn't follow these steps. Wow, that's my bad. I'll try that today. Thanks
@ambigus9, can you also tell me how to delete caffe2 if I installed it from source?
@Flock1 You're welcome! Anything you need just let me know.
@Flock1 At the moment i don't know how to do that. May me you can find any inspiration from the original source Pytorch
@ambigus9, the docker installation worked!! However, when I ran the detectron code for showing some demo, I got the following:
Found Detectron ops lib: /usr/local/lib/libcaffe2_detectron_ops_gpu.so
E0425 12:02:28.544355 272 init_intrinsics_check.cc:43] CPU feature avx is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0425 12:02:28.544375 272 init_intrinsics_check.cc:43] CPU feature avx2 is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
E0425 12:02:28.544394 272 init_intrinsics_check.cc:43] CPU feature fma is present on your machine, but the Caffe2 binary is not compiled with it. It means you may not get the full speed of your CPU.
INFO io.py: 67: Downloading remote file https://s3-us-west-2.amazonaws.com/detectron/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl to /tmp/detectron-download-cache/35861858/12_2017_baselines/e2e_mask_rcnn_R-101-FPN_2x.yaml.02_32_51.SgT4y1cO/output/train/coco_2014_train:coco_2014_valminusminival/generalized_rcnn/model_final.pkl
And it's working very slowly. So is my GPU being used?
@Flock1 I have the same warning, Could you complete the inference simple?
@ambigus9, yeah. I realized that the internet was very slow, so downloading the model was taking time. I think it's finally done through docker.
@Flock1 Perfect! Anything you need i will help you.
@ambigus9, let me know where I should contact you?
I think this issue can be closed as the problem seems to be solved.
I dont have administrative rights on my system. So I am installing
conda install -c anaconda cudnn
conda install -c anaocnda nccl or conda installl -c pytorch nccl2
cconda install -c caffe2 caffe2-cuda9.0-cudnn7
but I get the same error
WARNING:root:This caffe2 python run does not have GPU support. Will run in CPU only mode. WARNING:root:Debug message: libnccl.so.2: cannot open shared object file: No such file or directory Segmentation fault (core dumped)
Is it because I am installing using conda?
@Flock1 Were you able to solve this problem with @YefeiGao 's suggestion?
I do not find any libnccl.so in environment folder
@garvita-tiwari Did you tried using Dockerfile instructions?
@ambigus9 I do not have root privileges on my system and do not have docker either. Is there any way with conda? Or I should use Docker only?
@garvita-tiwari Currently I recommend Docker, Maybe you can try to install Docker on your System.
@garvita-tiwari, the WARNING that you're getting is because it's not able to locate the file. You'll need to locate the file on your system and add it to your .bashrc file.
You can also try to use conda environments as mentioned on the website.
Docker will also work.
blah blah
Most helpful comment
u should install nccl2 and make sure add the path contain libnccl.so.2 to ur $LD_LIBRARY_PATH to solve that @Flock1