Hello,
I tried to install alpaka (I cloned from GitHub, and did not check out any specific tag) inside a docker container, based on a centos7 image. I am able to compile the vector_add example, but if fails to run. It looks like at runtime the GPU cannot be found if I understand correctly?
./example/vectorAdd/vectorAdd
Using alpaka accelerator: AccGpuCudaRt<1,m>
terminate called after throwing an instance of 'std::runtime_error'
what(): Unable to return device handle for device 0. There are only 0 devices!
Aborted (core dumped)
The machine I run on does have a GPU:
00:05.0 VGA compatible controller: NVIDIA Corporation GP104GL [Quadro P4000] (rev a1)
I have installed gcc9, nvidia 11 and cmake3 inside the image already. I post below the output of the steps to setup alpaka in the hope you may spot what I am missing. I do notice two errors in it, namely:
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed (I am not sure how to fix this, if it needs fixing?)
-- Optional alpaka dependency Boost.Fiber could not be found! Fiber back-end disabled! (I don't think this one is needed for the examples?)
Full output is:
[root@fc09a8eda099 build]# cmake -DCMAKE_INSTALL_PREFIX=/install/ ..
-- The CXX compiler identification is GNU 9.3.1
-- Check for working CXX compiler: /opt/rh/devtoolset-9/root/usr/bin/c++
-- Check for working CXX compiler: /opt/rh/devtoolset-9/root/usr/bin/c++ - works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Found Boost: /usr/local/boost_1_73_0 (found suitable version "1.73.0", minimum required is "1.65.1") missing components: fiber context
-- Optional alpaka dependency Boost.Fiber could not be found! Fiber back-end disabled!
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Found CUDA: /usr/local/cuda (found suitable version "11.0", minimum required is "9.0")
-- ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED
-- ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED
-- ALPAKA_ACC_CPU_BT_OMP4_ENABLED
-- ALPAKA_ACC_GPU_CUDA_ENABLED
-- Configuring done
-- Generating done
-- Build files have been written to: /alpaka/build
//install
//then
cmake -DALPAKA_ACC_GPU_CUDA_ENABLE=ON ..
-- Optional alpaka dependency Boost.Fiber could not be found! Fiber back-end disabled!
-- Found OpenMP_CXX: -fopenmp
-- Found OpenMP: TRUE
-- ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED
-- ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED
-- ALPAKA_ACC_GPU_CUDA_ENABLED
-- Configuring done
-- Generating done
-- Build files have been written to: /alpaka/build
[root@fc09a8eda099 build]# cmake -Dalpaka_BUILD_EXAMPLES=ON ..
-- Optional alpaka dependency Boost.Fiber could not be found! Fiber back-end disabled!
-- ALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_THREADS_ENABLED
-- ALPAKA_ACC_CPU_B_TBB_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_OMP2_T_SEQ_ENABLED
-- ALPAKA_ACC_CPU_B_SEQ_T_OMP2_ENABLED
-- ALPAKA_ACC_GPU_CUDA_ENABLED
-- The C compiler identification is GNU 9.3.1
-- Check for working C compiler: /opt/rh/devtoolset-9/root/usr/bin/cc
-- Check for working C compiler: /opt/rh/devtoolset-9/root/usr/bin/cc - works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /alpaka/build
make vectorAdd
[ 50%] Building NVCC (Device) object example/vectorAdd/CMakeFiles/vectorAdd.dir/src/vectorAdd_generated_vectorAdd.cpp.o
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
Scanning dependencies of target vectorAdd
[100%] Linking CXX executable vectorAdd
[100%] Built target vectorAdd
[root@fc09a8eda099 build]# ./example/vectorAdd/vectorAdd
Using alpaka accelerator: AccGpuCudaRt<1,m>
terminate called after throwing an instance of 'std::runtime_error'
what(): Unable to return device handle for device 0. There are only 0 devices!
Aborted (core dumped)
Thanks,
Mark
Hello @shefmarkh ,
It looks like indeed alpaka does not see your CUDA GPU. Since internally we just use the corresponding CUDA API function, I assume it also does not see it. In the attached log everything seems fine (please see my explanation on the other two messages below, they are unrelated). Could you check if the GPU is visible to e.g. some standard CUDA examples?
Regarding the two other messages:
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed (I am not sure how to fix this, if it needs fixing?)
-- Optional alpaka dependency Boost.Fiber could not be found! Fiber back-end disabled! (I don't think this one is needed for the examples?)
You are right that those are not errors. It just means some optional alpaka dependencies are not found during the build, and so the corresponding parts will not be built, like the Boost.Fiber backend. By default alpaka builds for everything that is found, however one could also ask to enable something explicitly, and then it would result in a hard error when not found.
Perhaps @SimeonEhrig or @psychocoderHPC have any idea, since unlike me they have experience with containers?
Hello,
Yes I checked and CUDA code I have written does not run in this container (I verified the same code runs fine on another machine I use with CUDA). So it may be some issue with the GPU interacting with the container then - a quick google on the CUDA error I see suggests it a driver issue for the nvidia card, so I may need to figure out how to install the correct drivers inside the container. Thanks for the hint.
Cheers,
Mark
Hi Mark,
did you run the docker container with the flag --runtime=nvidia? Without the flag, there is no CUDA support inside the container. For CUDA support, it is also necessary, that the nvidia-docker extension is installed: https://github.com/NVIDIA/nvidia-docker#quick-start
Hello,
I have installed the nvidia plugin with:
yum install -y nvidia-container-toolkit
and restarted docker. The documentation seems to indicate I should use the new --gpus all option. Still the problem persists when I try to run the vectorAdd example inside the container.
I also tried the --runtime=nvidia option, which does not work with docker and this plugin out of the box:
docker: Error response from daemon: Unknown runtime specified nvidia.
Cheers,
Mark
Hi,
I can reproduce the problem on or dev system. This happens when the CUDA mode is not activated. I have tested both docker arguments, --gpus all and --runtime=nvidia. You can easily test if the extensions work. If nvidia-smi is not available inside the container, the extension does not work.
As I see, Nvidia is natively supported since Docker version 19.03. Which version do you have?
Cheers,
Simeon
Hello,
I have:
docker --version
Docker version 19.03.12, build 48a66213fe
docker run --gpus all --runtime=nvidia -i -t -v /cvmfs:/cvmfs -v $HOME:$HOME shefmarkh/excalibur:excalibur_cc7_1_5
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
but I can run:
docker run --gpus all -i -t -v /cvmfs:/cvmfs -v $HOME:$HOME shefmarkh/excalibur:excalibur_cc7_1_5
[root@623615e0efa1 /]# nvidia-smi
Tue Jul 28 11:52:43 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:00:05.0 Off | N/A |
| 35% 29C P0 26W / 105W | 0MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@623615e0efa1 build]# ./example/vectorAdd/vectorAdd
Using alpaka accelerator: AccGpuCudaRt<1,m>
terminate called after throwing an instance of 'std::runtime_error'
what(): Unable to return device handle for device 0. There are only 0 devices!
Aborted (core dumped)
so the nvidia-smi command works, but I cannot run alpaka (nor other cuda code I have). So there is something still missing for me.
Cheers,
Mark
I tried your docker image and got a strange error when running nvidia-smi: Failed to initialize NVML: Driver/library version mismatch. I also tried the container nvidia/cuda:latest and everything is fine. Maybe you should also try this container to check your docker setup. Is it possible that I can get the dockerfile from your container? Maybe I can find the problem.
Did you have to install things in the nvidia/cuda image? I tried running:
docker run --gpus all -i -t -v /cvmfs:/cvmfs -v $HOME:$HOME nvidia/cuda:latest
and it does not have cmake.
cmake -DCMAKE_INSTALL_PREFIX=/install/ ..
bash: cmake: command not found
So rather than build the example, I instead tried to run the example I have built in my own container previously and in the nvidia container this indeed runs:
./example/vectorAdd/vectorAdd
Using alpaka accelerator: AccGpuCudaRt<1,m>
Execution results correct!
So the issue is related to my own image (based on centos7).
Cheers,
Mark
btw my image is here:
https://hub.docker.com/repository/docker/shefmarkh/excalibur
(its missing boost, which alpaka needs. I just installed that in the container based on that image to try alpaka)
Cheers,
Mark
Did you have to install things in the nvidia/cuda image? I tried running:
docker run --gpus all -i -t -v /cvmfs:/cvmfs -v $HOME:$HOME nvidia/cuda:latest
and it does not have cmake.
cmake -DCMAKE_INSTALL_PREFIX=/install/ ..
bash: cmake: command not foundCheers,
Mark
I tested a small CUDA example instead of Alpaka to see if CUDA applications in general work. You can use this sehrig/alpaka:dev-cuda image from Dockerhub (docker run --gpus all -it sehrig/alpaka:dev-cuda) to test Alpaka. It contains everything you need to build Alpaka with the CUDA backend.
I have already found your image on DockerHub and downloaded and run it. This is how I got the nvidia-smi error. I just ask for the dockerfile (recipe) because otherwise it's very hard to find out what's different between your image and my images. For example, in a simple ubuntu:bionic image nvidia-smi works on my system too.
Simeon
Ah I see what you mean. I used the Dockerfile from:
https://github.com/stfc/grid-workernode.git
in the docker-c7 folder to create my image, which I then added things to such as gcc 9, cmake3 etc.
Cheers,
Mark
I think, I found the problem. Do you install the CUDA SDK like here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=CentOS&target_version=7&target_type=rpmlocal ?
I just installed the CUDA SDK via yum install cuda without any driver stuff and after that nvidia-smi does not work anymore. Before it was fine. II also got the error message during the installation:
Failed:
nvidia-driver-latest-cuda.x86_64 3:450.51.05-1.el7 nvidia-persistenced-latest.x86_64 3:450.51.05-1.el7
Both are processes that should not be built into the container. I would suggest that you try an [official CUDA image] (https://hub.docker.com/r/nvidia/cuda/) as a baseimage. Otherwise, you should try to install CUDA like Nvidia does in its container: https://gitlab.com/nvidia/container-images/cuda/-/tree/master/dist/11.0/centos7-x86_64
These are the official recipes for the CUDA docking containers.
Thanks a lot for the help! I made a new image whereby I added the lines from the nvidia docker file to the one I was using, and on loading the image in docker find I can run the vectorAdd example :)
Cheers,
Mark
Nice! I will close this issue. Please feel free to re-open or create new ones if there are more issues or questions.
Most helpful comment
Thanks a lot for the help! I made a new image whereby I added the lines from the nvidia docker file to the one I was using, and on loading the image in docker find I can run the vectorAdd example :)
Cheers,
Mark