xgboost (0.80) multi gpu build from source (Could NOT find Nccl)

Created on 14 Aug 2018 · 24Comments · Source: dmlc/xgboost

Hello,

Anyone can help me with the following issue?
I just follow the instruction to build xgboost with mutiple gpu:
https://xgboost.readthedocs.io/en/latest/build.html.

I have also set the environmental variable by

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/tmp/nccl_2.2.13/lib
export LD_LIBRARY_PATH
NCCL_INCLUDE_DIR=/tmp/nccl_2.2.13
export NCCL_INCLUDE_DIR
NCCL_LIBRARY=/tmp/nccl_2.2.13/lib
export NCCL_LIBRARY

The log is:
$ cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON
-- The C compiler identification is GNU 4.8.5
-- The CXX compiler identification is GNU 4.8.5
-- Check for working C compiler: /usr/lib64/ccache/cc
-- Check for working C compiler: /usr/lib64/ccache/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/lib64/ccache/c++
-- Check for working CXX compiler: /usr/lib64/ccache/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")
-- Found OpenMP: TRUE (found version "3.1")
-- Setting build type to 'Release' as none was specified.
-- Performing Test SUPPORT_CXX11
-- Performing Test SUPPORT_CXX11 - Success
-- Performing Test SUPPORT_CXX0X
-- Performing Test SUPPORT_CXX0X - Success
-- Performing Test SUPPORT_MSSE2
-- Performing Test SUPPORT_MSSE2 - Success
-- Found OpenMP_C: -fopenmp (found version "3.1")
-- Found OpenMP_CXX: -fopenmp (found version "3.1")
-- Could NOT find GTest (missing: GTEST_LIBRARY GTEST_INCLUDE_DIR GTEST_MAIN_LIBRARY)
CMake Warning at dmlc-core/test/unittest/CMakeLists.txt:37 (message):
Google Test not found

-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found CUDA: /usr/local/cuda (found suitable version "9.2", minimum required is "8.0")
CMake Error at /tmp/cmake-3.12.1/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find Nccl (missing: NCCL_INCLUDE_DIR NCCL_LIBRARY)
Call Stack (most recent call first):
/tmp/cmake-3.12.1/Modules/FindPackageHandleStandardArgs.cmake:378 (_FPHSA_FAILURE_MESSAGE)
cmake/modules/FindNccl.cmake:51 (find_package_handle_standard_args)
CMakeLists.txt:134 (find_package)

-- Configuring incomplete, errors occurred!
See also "/tmp/xgboost_mgpu/build/CMakeFiles/CMakeOutput.log".
See also "/tmp/xgboost_mgpu/build/CMakeFiles/CMakeError.log".

Source

ch11y

Most helpful comment

You should set NCCL_ROOT variable as an argument:

cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DNCCL_ROOT=/tmp/nccl_2.2.13

hcho3 on 14 Aug 2018

👍2

All 24 comments

You should set NCCL_ROOT variable as an argument:

cmake .. -DUSE_CUDA=ON -DUSE_NCCL=ON -DNCCL_ROOT=/tmp/nccl_2.2.13

hcho3 on 14 Aug 2018

👍2

Also consider using the binary wheel: https://s3-us-west-2.amazonaws.com/xgboost-wheels/xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl

hcho3 on 14 Aug 2018

Thanks a lot @hcho3
It solves my problem. I will close this issue.

ch11y on 15 Aug 2018

@hcho3
Is there any way that this binary wheel successfully install on my machine (after Cuda 9.1 and all other required installations), however, only one GPU is used during learning? (I can see it in nvidia-smi)

BTW if I installed NCCL2 from sudo apt-get install libnccl2=2.0.0-1+cuda8.0 libnccl-dev=2.0.0-1+cuda8.0, where would my NCCL folder be?? I can't find that

hlbkin on 17 Aug 2018

👍1

@hlbkin What is the file name of the binary wheel? If it was xgboost-0.80-py2.py3-none-manylinux1_x86_64.whl, it does not support multi-GPU training. Make sure to install the wheel named xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl. This file is not available on PyPI due to its size, so download it from https://s3-us-west-2.amazonaws.com/xgboost-wheels/xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl and run

pip3 install xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl

Also, make sure to set n_gpus=-1 to use all available GPUs.

hcho3 on 17 Aug 2018

@hcho3 Everything works fine! I just forgot n_gpus=-1. xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl this one correctly use multiple GPU (after properly installing cuda 9.1 and NCCL2 matching versions).

Last question: Why using n_gpus=-1 still makes my 10 CPU cores to load fully? (100% load of all cores)

I used tree_method='gpu_hist' and predictor='gpu:reg:linear'.

Should I create separate issue for that question?

hlbkin on 18 Aug 2018

Is CPU usage high when you set n_gpu=1?

hcho3 on 19 Aug 2018

@hcho3
No, using n_gpus=1 gives only 1 process fully loaded. However, I now run into another trouble which is already mentioned in separate topic (https://github.com/dmlc/xgboost/issues/3605)

hlbkin on 19 Aug 2018

BTW why don't you guys include wheel package into official instructions of GPU build so people do not try to search through issues on GPU build..

Also I wonder if my build of CUDA and NCCL on 18.04 ubuntu might also broke something or have any effect...(instead of using 16.04). Clearly some info on that and also which versions of NCCL and CUDA is doing STABLE job would be great.

hlbkin on 19 Aug 2018

@hlbkin Multi-GPU training needs to shuffle data through the main memory, so CPU has to be involved.

BTW why don't you guys include wheel package into official instructions of GPU build so people do not try to search through issues on GPU build..

We recently added instructions for binary wheels on top of Installation Guide: https://xgboost.readthedocs.io/en/latest/build.html. Feedback is welcome.

Also I wonder if my build of CUDA and NCCL on 18.04 ubuntu might also broke something or have any effect...

Are you referring to #3605?

hcho3 on 19 Aug 2018

👍1

@hcho3
Regarding MultiGPU shuffling it is clear, thanks a lot. However, is there a way to reduce CPUs usage to say some percentage of CPUs? Also is there any guidance where n_gpus>=2 for a single dataset training would give faster performance compared to single GPU? or it is purely empirical task and should be done separately for everyone on their own datasets?

I was just referring to the general fact, that one slight mistake (in OS/Cuda/Other low level libraries) set up (or mismatch with what authors have build it) might result in library not working/compiling etc. So it might be also useful in future to create some sort of guidance in official instructions for what pairs of GPUs/installations works and have been successfully tested.
Probably your binary wheels might solve now a lot of problems, so that is a question to other users whether we need such guide or not. (people spend a lot of time searching through issues just to set up GPUs version...)

hlbkin on 19 Aug 2018

@hlbkin We started providing binary wheels only recently, so hopefully this would solve a lot of installation problems. The wheels are built with "old" environment (CentOS 7 for Linux, Windows Server 2008 for Windows, both with CUDA 8.0) so as to maximize compatibility. I'd wait and see if people are still having issues.

Also is there any guidance where n_gpus>=2 for a single dataset training would give faster performance compared to single GPU? or it is purely empirical task and should be done separately for everyone on their own datasets?

Yes, multi-GPU performance is dataset-dependent. In general, the use case of multi-GPU training is: 1) your data doesn't fit inside a single GPU and/or 2) your data is "thin, long table" (i.e. many data instances, number of features comparatively low), since the amount of inter-GPU communication is proportional to the number of features.

is there a way to reduce CPUs usage to say some percentage of CPUs?

You can tell the OS to throttle CPU usage: http://blog.scoutapp.com/articles/2014/11/04/restricting-process-cpu-usage-using-nice-cpulimit-and-cgroups. Notice though that throttling CPU usage may adversely affect GPU performance as well, since GPUs may be waiting for data from the main memory.

hcho3 on 20 Aug 2018

@hcho3
Thanks a lot, your answers are very helpful. Binary wheels are very helpful. BTW might the fact that it was built with CUDA 8.0 technically lower the performance compared to building with latest 9.2 and latest NCCL?
Another question is more about architecture of production code. What is relatively good practice for handling situation where you run your dataset and it automatically decide to use some of the free GPUs? One way that I see is to use "gpu_id" as a parameter of my script, but it is kind of manual/quasi-manual work. Is there a way to understand which GPU is free and assign gpu_id accordingly?

hlbkin on 20 Aug 2018

might the fact that it was built with CUDA 8.0 technically lower the performance compared to building with latest 9.2 and latest NCCL?

This could be the case, but I haven't verified it. Binary wheels aim for maximum compatibility, so I had to make some tradeoff here.

Is there a way to understand which GPU is free and assign gpu_id accordingly?

You can put the GPUs in exclusive mode so that only one process can use each GPU at a time. Then you can write a short CUDA program to query which GPU is free. (The program will simply attempt to establish GPU context.)

hcho3 on 20 Aug 2018

@hlbkin Multi-GPU training needs to shuffle data through the main memory, so CPU has to be involved.

BTW why don't you guys include wheel package into official instructions of GPU build so people do not try to search through issues on GPU build..

We recently added instructions for binary wheels on top of Installation Guide: https://xgboost.readthedocs.io/en/latest/build.html. Feedback is welcome.

Also I wonder if my build of CUDA and NCCL on 18.04 ubuntu might also broke something or have any effect...

Are you referring to #3605?

Hi - The docs to compile GPU support is great, two suggestions:

Add -DNCCL_ROOT= switch description, I struggled with it until I found this page.
After compiling, add clear instruction for python and R installation. I think for R it's pretty clear as the build would create a package. However, for python, it wasn't very clear. Do I install from ~/git/xgboost/build? Or do I need to go to ~/git/xgboost/python-package/ and then python setup.py install? Would the it pick up the binaries built in ../build folder?

Let me know if this is not clear? Thanks.

esvhd on 4 Oct 2018

@esvhd I update the doc in #3767.

hcho3 on 7 Oct 2018

👍1

@hcho3

this file is not available on PyPI due to its size, so download it from https://s3-us-west-2.amazonaws.com/xgboost-wheels/xgboost-multigpu-0.80-py2.py3-none-manylinux1_x86_64.whl and run

Suddenly I'm curious how come tensorflow-gpu can be downloaded via pip.

https://files.pythonhosted.org/packages/55/7e/bec4d62e9dc95e828922c6cec38acd9461af8abe749f7c9def25ec4b2fdb/tensorflow_gpu-1.12.0-cp36-cp36m-manylinux1_x86_64.whl (281.7MB)

Notice the size of it.

trivialfis on 18 Nov 2018

@trivialfis They must have gotten a special exception. See https://pypi.org/help/#file-size-limit.

hcho3 on 18 Nov 2018

@hcho3 I see. Thanks!

trivialfis on 18 Nov 2018

@trivialfis Deep learning frameworks such as TensorFlow and MXNet are large beasts containing many operators. So they justify an exception for large upload sizes. I feel that XGBoost is quite not that large to justify seeking an exception.

hcho3 on 18 Nov 2018

😄1

@hcho3 No problem. I think the currently way is sufficient, it's just I came across tensorflow and noticed its size.

trivialfis on 18 Nov 2018

We could actually compile nccl ourselves as it is now open source. Not sure if this gives any size advantage.

RAMitchell on 18 Nov 2018

@RAMitchell This is a great news! I hope there would be Windows support too. As for size advantage, we'll have to try it out.

hcho3 on 18 Nov 2018

@hcho3 Had some trouble building for multi-gpu with NCCL for a day or two, but this thread was very helpful so thank you. Realizing now I just needed to read the instructions on the installation page a bit more closer, but I ended up here because there was a little confusion about the appropriate flags and dynamic linking paths for NCCL install. I realize it's repeating information found elsewhere, but adding it might help some others save a few hours in the future using this code. I ultimately ended up using a different flag than specified in the instructions for NCCL (NCCL_INCLUDE_DIR pointing to the header file).