Singularity: How do the nv option and nvidia/cuda* containers interact?

Created on 3 Dec 2018  路  5Comments  路  Source: hpcng/singularity

I am trying to containerize software that requires different versions of cuda and cudnn. I have cuda9.1 on host and I am not exactly sure why, if I use docker image nvidia/cuda:7.5-cudnn5-devel-ubuntu14.04 during build without --nv option, all software picks up library versions correctly (cuda7.5 and cudnn5), but if I try to run things without this option, things do not quite work. As far as I understood, nv options just "propagates" cuda libraries into the container, correct? So containers that are supported to have it (nvidia/cuda*), should work without it, correct?

Most helpful comment

"CUDA" is composed of two parts. The first piece is the GPU driver which consists of kernel and user space components, e.g. libcuda.so. The second part is the toolkit and associated libraries, e.g. nvcc and cudnn.

When building CUDA applications, including in a container, only the toolkit is used and it doesn't require any of the driver components. For instance I can build CUDA applications on a machine without GPU hardware or driver install. This means that if you use the nvidia/cuda:7.5-cudnn5-devel-ubuntu14.04 image you should be able to build CUDA applications just fine without anything additional being passed into the container from the host.

When running CUDA applications the driver components are required. When we're dealing with containers that means that the user-space driver libraries must be made available to the container, which is what --nv should do. The reason for not including these libraries statically in the container is that they are effectively version matched to the kernel components, which we can't control from the container.

All 5 comments

Run singularity exec --nv image.sif ls /.singularity/libs to see what is being passed through. Usually this path gets appended to the end of LD_LIBRARY_PATH at the container鈥檚 runtime (regardless of passthrough), so check it there鈥檚 any conflicts there.

Going to need a bit more info (and maybe typo fixes) to fully understand what鈥檚 going on.

Dockerfile in question:

ARG repository
FROM ${repository}:7.5-devel-ubuntu14.04
LABEL maintainer "NVIDIA CORPORATION <[email protected]>"
RUN echo "deb https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64 /" > /etc/apt/sources.list.d/nvidia-ml.list
ENV CUDNN_VERSION 5.1.10
LABEL com.nvidia.cudnn.version="${CUDNN_VERSION}"
RUN apt-get update && apt-get install -y --no-install-recommends \
            libcudnn5=$CUDNN_VERSION-1+cuda7.5 \
            libcudnn5-dev=$CUDNN_VERSION-1+cuda7.5 && \
    rm -rf /var/lib/apt/lists/*

"CUDA" is composed of two parts. The first piece is the GPU driver which consists of kernel and user space components, e.g. libcuda.so. The second part is the toolkit and associated libraries, e.g. nvcc and cudnn.

When building CUDA applications, including in a container, only the toolkit is used and it doesn't require any of the driver components. For instance I can build CUDA applications on a machine without GPU hardware or driver install. This means that if you use the nvidia/cuda:7.5-cudnn5-devel-ubuntu14.04 image you should be able to build CUDA applications just fine without anything additional being passed into the container from the host.

When running CUDA applications the driver components are required. When we're dealing with containers that means that the user-space driver libraries must be made available to the container, which is what --nv should do. The reason for not including these libraries statically in the container is that they are effectively version matched to the kernel components, which we can't control from the container.

@AdamSimpson's answer is fantastic. The only thing I can add is that his answer also explains why CUDA is _not_ bind mounted into the container when the --nv option is passed. The CUDA version is not tied to the driver version (though there may be minimum version requirements on the driver). But the CUDA version may be closely related to the software running in the container. So it needs to be provided within the container by the author.

Thank you for amazing answers @AdamSimpson and @GodloveD ! What confused me in the first place is that etc/nvliblist.conf mentions libcuda.so, that probably implies that --nv option forwards it too. Or it just defines driver 'endpoints' for cuda without specific cuda version in mind?

libcuda.so is one of the libraries installed by the CUDA driver, along with the rest of the libraries listed in /etc/nvliblist.conf and as such is passed in from the host to the container at runtime. There should only be a single CUDA driver installed on the host at any given time and so libcuda.so is sufficient to identify the correct library(it's a symlink to libcuda.so.<driver-version>).

The CUDA toolkit is by default installed under /usr/local/cuda-<version> and multiple versions can be installed concurrently. There is typically a symlink from the latest version to /usr/local/cuda. This toolkit is what's installed in the nvidia/cuda* containers and it is sufficient to build CUDA applications. This toolkit is necessary but not sufficient to run the container, you also need the driver libraries passed in through --nv. It might also be worth noting that the toolkit includes the libcudart.so library, which is easy to confuse with the driver library libcuda.so.

From a user perspective hopefully things are easier to make sense of. You want to use the --nv flag when running the container and ensure that the host driver meets the minimum requirement for the CUDA toolkit version your container is based off of. For CUDA toolkit/7.5 this means verifying that the host CUDA driver is >= v352.32.

Was this page helpful?
0 / 5 - 0 ratings