nvidia-docker on opensuse

Created on 8 Mar 2018  路  51Comments  路  Source: NVIDIA/nvidia-docker

hello,

I have tried to install nvidia-docker on opensuse leap 42.3
I have used centos package:

  • nvidia-docker-2.0.2
  • nvidia-container-runtime-1.2.1-1
  • libnvidia-container_1.0.0
  • nvidia-container-runtime-hook-1.2.1-1

After solving many problem in the installation I have finally managed to set all the packages but when I want to test the configuration

nvidia-docker run --rm nvidia/cuda nvidia-smi

I get

container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/local/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=6568 /var/lib/docker/btrfs/subvolumes/26850be101a4f175ec4d476f1a056526cd9202da1c4c7dfaaef32b7135ce82fb]\\\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\""
docker: Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/local/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=6568 /var/lib/docker/btrfs/subvolumes/26850be101a4f175ec4d476f1a056526cd9202da1c4c7dfaaef32b7135ce82fb]\\\\nnvidia-container-cli: initialization error: cuda error: no cuda-capable device is detected\\\\n\\\"\"".

I think that the container cannot recognize my gpu and I don't know why ?
when I run

nvidia-smi

I get this output on my host

| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 760     On   | 00000000:01:00.0 N/A |                  N/A |
| 34%   27C    P8    N/A /  N/A |    283MiB /  4030MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0                    Not Supported                                       |
+-----------------------------------------------------------------------------+

Can you please help me?

thank you.

enhancement packaging

Most helpful comment

Gave it a first try on creating the openSUSE RPM's, see the referenced PR. If you want to try it yourself, here is how to do it:

git clone https://github.com/dev-zero/nvidia-container-runtime.git -b opensuse-support nvidia-container-runtime-opensuse
git clone https://github.com/dev-zero/nvidia-docker.git -b opensuse-support nvidia-docker-opensuse

leap_version="15.0"

make -C nvidia-container-runtime-opensuse opensuse_leap${leap_version}
make -C nvidia-docker-opensuse opensuse_leap${leap_version}

# on openSUSE Leap 15.0 add the libnvidia-repo from centos7 for now since the CUDA Toolkit repo does not yet contain packages for it:
sudo zypper ar -c 'https://nvidia.github.io/libnvidia-container/centos7/$basearch' nvidia-container-runtime

# on openSUSE Leap 42.3 you can use my fork of the libnvidia-container repo to also build that component natively:
git clone https://github.com/dev-zero/libnvidia-container.git -b opensuse-support libnvidia-container-opensuse
make -C libnvidia-container-opensuse docker-opensuse_leap:42.3
sudo zypper install libnvidia-container-opensuse/*.rpm

sudo zypper install nvidia-{container-runtime,docker}-opensuse/dist/opensuse_leap${leap_version}/*.rpm
# ignore warnings about unsigned packages

# fully reload the systemd configuration and restart the docker daemon
# this is required since we have to change the flags passed to the docker daemon
sudo systemctl daemon-reload
sudo systemctl restart docker

All 51 comments

Can you try if the CUDA samples work for you? Having nvidia-smi working on the host doesn't mean CUDA will work on the host.

Install the CUDA toolkit, then go to /usr/local/cuda/samples/1_Utilities/deviceQuery, do make then ./deviceQuery.

Yes it works fine I am already using Tensorflow GPU on the host also.

this is what I did

cuda-install-samples-9.1.sh .
make 
./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 760"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4031 MBytes (4226744320 bytes)
  ( 6) Multiprocessors, (192) CUDA Cores/MP:     1152 CUDA Cores
  GPU Max Clock rate:                            1084 MHz (1.08 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1
Result = PASS

I don't know if it will helps but this is the debugging output
cat /var/log/nvidia-container-runtime-hook.log

-- WARNING, the following logs are for debugging purposes only --
I0309 20:21:35.121731 20558 nvc.c:274] initializing library context (version=1.0.0, build=be797da00b156493e80f1ae6f38d69f23c932554)
I0309 20:21:35.121856 20558 nvc.c:248] using root /
I0309 20:21:35.121878 20558 nvc.c:249] using ldcache /etc/ld.so.cache
I0309 20:21:35.121895 20558 nvc.c:250] using unprivileged user 65534:65534
I0309 20:21:35.122758 20564 nvc.c:184] loading kernel module nvidia
I0309 20:21:35.123254 20564 nvc.c:196] loading kernel module nvidia_uvm
I0309 20:21:35.123620 20564 nvc.c:204] loading kernel module nvidia_modeset
I0309 20:21:35.124227 20565 driver.c:136] starting driver service
I0309 20:21:35.127627 20558 driver.c:224] driver service terminated with signal 15

Should I provide more information ?

I see the issue. On OpenSUSE you have to be part of the video group to interact with the NVIDIA device files. They belong to root:video with permissions 0660.
That's different from other distros. I think we can fix that either in libnvidia-container or nvidia-container-runtime-hook (with --user=0:33) in the future.

In the meantime, it's going to be a bit complicated for you since we don't support OpenSUSE officially. The CentOS package won't work since you will need apparmor support in runc too. You should use the ubuntu16.04 binary of nvidia-container-runtime instead...
For the permission issue, two options:
1) patch the hook:

diff --git a/hook/nvidia-container-runtime-hook/main.go b/hook/nvidia-container-runtime-hook/main.go
index 15a2274..cd182ee 100644
--- a/hook/nvidia-container-runtime-hook/main.go
+++ b/hook/nvidia-container-runtime-hook/main.go
@@ -108,6 +108,7 @@ func doPrestart() {
        if cli.Ldcache != nil {
                args = append(args, fmt.Sprintf("--ldcache=%s", *cli.Ldcache))
        }
+       args = append(args, "--user=0:33")
        args = append(args, "configure")

        if cli.Ldconfig != nil {

2) Set wider permissions on the device files, note that this will be less secure!

sudo sed -i 's/660/666/g' /etc/modprobe.d/50-nvidia-default.conf
sudo /sbin/mkinitrd

Thank you very much for your answer.
So if I understand clearly. I need to

  • replace the file of nvidia-container-runtime-1.2.1-1 centos package with the one from ubuntu
  • add args = append(args, "--user=0:33") to the main.go file
  • build the package 'nvidia-container-runtime-hook'
  • use either centos or ubuntu generated package and replace my current nvidia-container-runtime-hook files.

Is this correct.

It's simpler if you don't recompile the hook, until I add the option. Here are the steps I used, use at your own risks :)

curl -s -L https://nvidia.github.io/nvidia-container-runtime/centos7/nvidia-container-runtime.repo | sudo tee /etc/zypp/repos.d/nvidia-container-runtime.repo
sudo zypper install nvidia-container-runtime-hook

wget 'https://nvidia.github.io/nvidia-container-runtime/ubuntu16.04/amd64/nvidia-container-runtime_2.0.0+docker17.09.1-1_amd64.deb'
sudo dpkg -X nvidia-container-runtime_2.0.0+docker17.09.1-1_amd64.deb /

echo 'DOCKER_OPTS="--add-runtime nvidia=/usr/bin/nvidia-container-runtime"' | sudo tee /etc/sysconfig/docker

sudo sed -i 's/660/666/g' /etc/modprobe.d/50-nvidia-default.conf
sudo /sbin/mkinitrd

sudo reboot

Thank you very much. I will try it in the weekend.

did you try it? Did it work?

Or better wait for official NVIDIA support of the SUSE distros?

yes I did and it works as expected.
I have add this to the hook file

  • args = append(args, "--user=0:33")

Gave it a first try on creating the openSUSE RPM's, see the referenced PR. If you want to try it yourself, here is how to do it:

git clone https://github.com/dev-zero/nvidia-container-runtime.git -b opensuse-support nvidia-container-runtime-opensuse
git clone https://github.com/dev-zero/nvidia-docker.git -b opensuse-support nvidia-docker-opensuse

leap_version="15.0"

make -C nvidia-container-runtime-opensuse opensuse_leap${leap_version}
make -C nvidia-docker-opensuse opensuse_leap${leap_version}

# on openSUSE Leap 15.0 add the libnvidia-repo from centos7 for now since the CUDA Toolkit repo does not yet contain packages for it:
sudo zypper ar -c 'https://nvidia.github.io/libnvidia-container/centos7/$basearch' nvidia-container-runtime

# on openSUSE Leap 42.3 you can use my fork of the libnvidia-container repo to also build that component natively:
git clone https://github.com/dev-zero/libnvidia-container.git -b opensuse-support libnvidia-container-opensuse
make -C libnvidia-container-opensuse docker-opensuse_leap:42.3
sudo zypper install libnvidia-container-opensuse/*.rpm

sudo zypper install nvidia-{container-runtime,docker}-opensuse/dist/opensuse_leap${leap_version}/*.rpm
# ignore warnings about unsigned packages

# fully reload the systemd configuration and restart the docker daemon
# this is required since we have to change the flags passed to the docker daemon
sudo systemctl daemon-reload
sudo systemctl restart docker

Many thanks @dev-zero for supporting us with a fork for openSuSE. I would love to try it immediately. But on the tryout/lab system I am running openSuSE Tumbleweed. And as far as I understand your instructions are Leap 42.3 / 15.0 specific and not simple 1:1 transferable to Tumbleweed. Right?

A few days ago I had dled the CoreOS 7 rpm tarballs. But was not sure which of those many RPMs within to install and how to do.

I would love to support the community by trying and reporting success of failure. But I am to anxious to do major damage to my system in case of failure. What do you think?

@dev-zero With your PR the built .rpms install & work on SLES12SP3. Thanks!!

@gsgxnet I guess the rpm's generated for Leap 15.0 should work on Tumbleweed as well since the distros are not that far away from each other at the moment and packaging-wise they are not very complex. Since Tumbleweed is a rolling distro I doubt that Nvidia will support it like ever, since they would in principal have to do regular rebuilds.

@smarguet can you provide some directions for 12SP3? I'm trying to build this for 12SP2

@dev-zero After installing .rpm packages from your fork on SLES12SP2, I can't start docker service.
journalctl -u docker.service says:

warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0
Failed to connect to containerd: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
docker.service: Main process exited, code=exited, status=1/FAILURE

Any clue?

@amilamanoj did you follow the instructions above about reloading systemd itself by running systemctl daemon-reload (as root) prior to restarting docker? Is that all the log output (starting from Starting Docker Application Container Engine...)?

@dev-zero Yes I followed the instructions and reloaded systemd itself.
This is the full journalctl -u docker.service log output since starting docker:

systemd[1]: Starting Docker Application Container Engine...
dockerd[18349]: time="2018-08-16T23:20:21+02:00" level=info msg="SUSE:secrets :: enabled"
dockerd[18349]: time="2018-08-16T23:20:21.277635169+02:00" level=info msg="parsed scheme: \"unix\"" module=grpc
dockerd[18349]: time="2018-08-16T23:20:21.277654546+02:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
dockerd[18349]: time="2018-08-16T23:20:21.277679178+02:00" level=info msg="ccResolverWrapper: sending new addresses to cc: [{unix:///run/containerd/containerd.sock 0  <
dockerd[18349]: time="2018-08-16T23:20:21.277695317+02:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
dockerd[18349]: time="2018-08-16T23:20:21.277724862+02:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420221ac0, CONNECTING" module=grpc
dockerd[18349]: time="2018-08-16T23:20:41.278018918+02:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.soc
dockerd[18349]: time="2018-08-16T23:20:41.278058523+02:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420221ac0, TRANSIENT_FAILURE" module=grpc
dockerd[18349]: time="2018-08-16T23:20:41.278181764+02:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420221ac0, CONNECTING" module=grpc
dockerd[18349]: time="2018-08-16T23:21:01.278379739+02:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.soc
dockerd[18349]: time="2018-08-16T23:21:01.278428133+02:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420221ac0, TRANSIENT_FAILURE" module=grpc
dockerd[18349]: time="2018-08-16T23:21:01.278539839+02:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420221ac0, CONNECTING" module=grpc
dockerd[18349]: Failed to connect to containerd: failed to dial "/run/containerd/containerd.sock": context deadline exceeded
systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
systemd[1]: docker.service: Unit entered failed state.
systemd[1]: docker.service: Failed with result 'exit-code'.

When I uninstall nvidia-docker2 package, docker can be started again.

@amilamanoj can you check whether containerd.service is running or whether there are any logs from containerd.servicewhich might indicate why the service is not running or responding?

@dev-zero is it supposed to be a service just like docker? when I do systemctl status containerd.service, it says not-found (no such file or directory). But funny thing is, when I uninstall nvidia-docker2, still there seems to be no containerd service, but docker works fine (I can even run docker images).

Maybe this is because of a problem with my docker installation. I'm using SLES12SP2 but I don't have an internet connection from the server, so I downloaded docker rpm and a couple of its dependency rpms (including containerd) from here and manually installed them.

I'm using docker 18.06.0-ce btw. I cloned the original nvidia-docker repos and applied your 3 PRs as patches on top of them before building.

@dev-zero I manually added the nvidia-runtime entry to docker.service file after uninstalling nvidia-docker2 package and restarted the daemon:
ExecStart=/usr/bin/dockerd --add-runtime nvidia=/usr/bin/nvidia-container-runtime

Now when I use docker run --runtime nvidia, I can use nvidia drivers within containers.
So the problem is only with nvidia-docker2 package.

This is good enough for me to start working for now. Thanks for your help.

@amilamanoj it seems that SLES12SP2 has some important changes compared to SLES12SP3: on SP3 the dockerd.service depends on a containerd.service:

[Unit]                                                                                                                                                                                            
Description=Docker Application Container Engine                                                                                                                                                   
Documentation=http://docs.docker.com                                                                                                                                                              
After=network.target containerd.socket containerd.service lvm2-monitor.service SuSEfirewall2.service                                                                                              
Requires=containerd.socket containerd.service                                                                                                                                                     
[...]
ExecStart=/usr/bin/dockerd --containerd /run/containerd/containerd.sock --add-runtime oci=/usr/sbin/docker-runc $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS
[...]

To be able to configure the runtimes via /etc/docker/daemon.json, one has to remove the part about --add-runtime oci=/usr/sbin/docker-runc, which is why I install the config snippet /usr/lib/systemd/system/docker.service.d/nvidia-docker.conf which overrides the SUSE/docker default commandline with the following:

/usr/bin/dockerd --containerd /run/containerd/containerd.sock $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS

which is likely what is causing the issue for you since you don't have/need containerd.

Can you post your (original) /usr/lib/systemd/system/docker.service?

@dev-zero Yes you're right, there's no --containerd argument in my original docker.service file:

[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target lvm2-monitor.service SuSEfirewall2.service

[Service]
EnvironmentFile=/etc/sysconfig/docker

# While Docker has support for socket activation (-H fd://), this is not
# enabled by default because enabling socket activation means that on boot your
# containers won't start until someone tries to administer the Docker daemon.
Type=notify
ExecStart=/usr/bin/dockerd --add-runtime oci=/usr/sbin/docker-runc $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS
ExecReload=/bin/kill -s HUP $MAINPID

...

Still a bit odd since I downloaded and installed docker rpms for SLES12SP3, because there were no docker builds available for SLES12SP2 here.

@dev-zero
I tried to build the rpm on Leap 15 using the procedure you gave. Unfortunately when running

make -C nvidia-container-runtime-opensuse opensuse_leap${leap_version}

I get an error

Step 14/18 : COPY runc/$RUNC_COMMIT/ /tmp/patches/runc
COPY failed: no source files were specified
make[1]: * [Makefile:135: 17.09.1-opensuse_leap15.0] Error 1
make[1]: Leaving directory '/home/me/src/nv-docker/nvidia-container-runtime-opensuse/runtime'
make: *
[Makefile:35: runtime-opensuse_leap15.0] Error 2
make: Leaving directory '/home/me/src/nv-docker/nvidia-container-runtime-opensuse'

An ideas how to solve this? Thanks

Update:

Fixed by changing line in Makefile

runc="$(shell $(MAKE) -s $@-runc)" &&
runc="$(shell $(MAKE) --no-print-directory -s $@-runc)" &&

@torhans I ran into the same problem today on Leap42.3. Thanks for the fix - I can confirm it also works on 42.3.

@torhans @danielorf thanks for the fix and commit. Merged the PR and rebased the opensuse-support branch of my fork of nvidia-container-runtime on latest master.

@torhans since you installed nvidia-docker on OpenSuse Leap 15, I guess that you managed installing Cuda drivers first? Did you meet any dfficulties? Cuda is not yet available for Leap15 and I've met a few problems using the version provided for Leap42.3...
I have a post here, at this point, any help would be so much appreciated!

@qmeeus What method did you use to install cuda on Leap 15? Local RPM, runfile, NVIDIA repo in YAST?

@qmeeus
I have installed
kernel and x11 drivers version 390.77 from http://http.download.nvidia.com/opensuse/leap/15.0/
and
cuda-drivers driver 390.30 from http://developer.download.nvidia.com/compute/cuda/repos/opensuse423/

It important to make sure the versions are kept when updating. I had problems with keeping consistency in libcuda, which is actually the main reason for using docker.

@danielorf I have followed the method in nvidia docs, namely using the rpm provided by nvidia for 42.3 (rpm network) but I used rpm and zypper instead of YaST, although I don't expect differences in that regard. I will try now with the versions indicated by @torhans ! Thanks a lot for sharing

@torhans thanks, it worked as a charm. For those who would happen to go through the same steps as me, one advice: don't bother compiling the samples. After installing cuda, just go to the step nvidia-docker, it will save you a lot of headaches...

@dev-zero
make: Entering directory '/usr/g/ctuser/nvidia/new/nvidia-docker-opensuse'
docker build --build-arg VERSION_ID="15.0"
--build-arg RUNTIME_VERSION="2.0.0-1.docker17.09.1"
--build-arg DOCKER_VERSION="docker = 17.09.1_ce"
--build-arg PKG_VERS="2.0.3"
--build-arg PKG_REV="1.docker17.09.1_ce"
-t "nvidia/nvidia-docker2/opensuse/leap:15.0-docker17.09.1.ce" -f Dockerfile.opensuse_leap .
Sending build context to Docker daemon 70.14 kB
Step 1 : ARG VERSION_ID
Please provide a source image with from prior to commit
Makefile:251: recipe for target '17.09.1_ce-opensuse_leap15.0' failed
make: * [17.09.1_ce-opensuse_leap15.0] Error 1
make: Leaving directory '/usr/g/ctuser/nvidia/new/nvidia-docker-opensuse'
**I install it on suse12sp3 ,the out put bellow

so how could i get the base image opensuse_leap15.0

Gave it a first try on creating the openSUSE RPM's, see the referenced PR. If you want to try it yourself, here is how to do it:

git clone https://github.com/dev-zero/nvidia-container-runtime.git -b opensuse-support nvidia-container-runtime-opensuse
git clone https://github.com/dev-zero/nvidia-docker.git -b opensuse-support nvidia-docker-opensuse

leap_version="15.0"

make -C nvidia-container-runtime-opensuse opensuse_leap${leap_version}
make -C nvidia-docker-opensuse opensuse_leap${leap_version}

# on openSUSE Leap 15.0 add the libnvidia-repo from centos7 for now since the CUDA Toolkit repo does not yet contain packages for it:
sudo zypper ar -c 'https://nvidia.github.io/libnvidia-container/centos7/$basearch' nvidia-container-runtime

# on openSUSE Leap 42.3 you can use my fork of the libnvidia-container repo to also build that component natively:
git clone https://github.com/dev-zero/libnvidia-container.git -b opensuse-support libnvidia-container-opensuse
make -C libnvidia-container-opensuse docker-opensuse_leap:42.3
sudo zypper install libnvidia-container-opensuse/*.rpm

sudo zypper install nvidia-{container-runtime,docker}-opensuse/dist/opensuse_leap${leap_version}/*.rpm
# ignore warnings about unsigned packages

# fully reload the systemd configuration and restart the docker daemon
# this is required since we have to change the flags passed to the docker daemon
sudo systemctl daemon-reload
sudo systemctl restart docker

@Gitwangshuo, the script provided by @dev-zero worked for me. Make sure to remove/comment the lines relative to Leap 42.3. Also, I had to uninstall nvidia-docker (or remove it from the script before running it) because it raises conflicts between docker's daemon.json and the flags of the execution instruction that I was not able to solve. Instead, you can use the --runtime flag when running your container (i.e. docker run -d --runtime nvidia container_name).

@qmeeus , sorry I have got a wrong way ,My host os is suse12sp2, Nvidia on OpensuseLeap 42.3 will be more suitable for SUSE12 SP2 ,so I try to following the steps 馃憤

# on openSUSE Leap 42.3 you can use my fork of the libnvidia-container repo to also build that component natively:
git clone https://github.com/dev-zero/libnvidia-container.git -b opensuse-support libnvidia-container-opensuse
make -C libnvidia-container-opensuse docker-opensuse_leap:42.3
sudo zypper install libnvidia-container-opensuse/*.rpm

how ever when i try to build libnvidia-container:opensuse:42.3 image I got a lot of problem
/bin/bash -c export META_NOECHO=echo && make distclean && make CUDA_DIR=$(ls -d /usr/local/cuda-*/) -j"$(nproc)"' returned a non-zero code: 2
@qmeeus Did you do install in by source code? or RPMs?
I really need your help

@dev-zero I have followd your README.MD

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
zypper ar https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo nvidia-docker
zypper ref

Retrieving repository 'nvidia-docker' metadata -----------------------------------------------------------------------------------------------[/]
Download (curl) error for 'https://nvidia.github.io/nvidia-docker/opensuse42.3/nvidia-docker.repo/repodata/repomd.xml':
Error code: Curl error 60
Error message: SSL certificate problem: unable to get local issuer certificate

Abort, retry, ignore? [a/r/i/...? shows all options] (a): i
Abort, retry, ignore? [a/r/i/...? shows all options] (a): i
Retrieving repository 'nvidia-docker' metadata ...........................................................................................[error]
Repository 'nvidia-docker' is invalid.
[nvidia-docker|https://nvidia.github.io/nvidia-docker/opensuse42.3/nvidia-docker.repo] Valid metadata not found at specified URL

Did the openSUSE nvidia repo exsits?

@Gitwangshuo I have installed using rpm but I am running Leap 15.0 so it was pretty straightforward in my case. I see your last comment, did you manage to build libnvidia-container in the end? If you only problem is with nvidia-docker then a quick workaround is to not install it and run with the --runtime flag

@qmeeus Yes ,I want to install it by rpm but I SUSE remote repo cannot work
'https://nvidia.github.io/nvidia-docker/opensuse42.3/nvidia-docker.repo'
I am not sure about it , May be it was banned in china;
one more thing i want to ask you is , How could you get the Leap 15.0 Rpms ?

@Gitwangshuo the first line of the script ie git clone https://github.com/dev-zero/nvidia-container-runtime.git -b opensuse-support nvidia-container-runtime-opensuse

they are located in ./nvidia-container-runtime-opensuse/dist/opensuse_leap15.0

Not sure whether they are compatible with SLES though

First, I'd like to thank you all for making nvidia-docker available on OpenSuse. I've come a very long way in the process of getting this working, but at the very end I'm hitting a failure.

OS: OpenSUSE Leap 15
Docker: 18.09.0_ce
NVIDIA Card: Quadro M620
NVIDIA Driver: 410.93
CUDA: 10.0

I installed Docker using an "Experimental" package found on software.opensuse.org:
http://download.opensuse.org/repositories/devel:/CaaSP:/Head:/ControllerNode/openSUSE_Leap_15.0/x86_64/docker-18.09.0_ce-lp150.4.2.x86_64.rpm
I've been able to run a number of containers without issue.

I installed the NVIDIA card drivers via the runfile: NVIDIA-Linux-x86_64-410.93.run
Also, without issue. This card is the only GPU in the system.

In addition to this post, I've been using the following as a guide:
https://marmelab.com/blog/2018/03/21/using-nvidia-gpu-within-docker-container.html
Mainly regarding the persistence daemon and the udev rule. The persistence daemon is active on the system, and my first attempts did not include it so I'm not sure if it's necessary for this activity (please advise...).

For CUDA, I downloaded the runfile:
cuda_10.0.130_410.48_linux.run
When running it, I only allowed it to install the toolkit and the samples. Running nvidia-smi yields the following:

| NVIDIA-SMI 410.93       Driver Version: 410.93       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M620         On   | 00000000:01:00.0  On |                  N/A |
| N/A   48C    P0    N/A /  N/A |   1468MiB /  1968MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+

Per information I gleaned along the way, I've come to believe that that doesn't mean that CUDA is actually working, so per the previously mentioned blog I've compiled the samples and the RESULT = PASS.

To this point, I've got a working graphics driver, working docker and working CUDA. Per the instructions above from @dev-zero, I've cloned the git repos, but because I'm using a newer version of docker than the code accounts for, I've had to make a couple tweaks. In nvidia-container-runtime-opensuse/runtime/Makefile (validating the runc fingerprint via "docker info"):

34c34,37
< opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 17.09.1)
---
> opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 18.09.0)
>
> 18.09.0-%-runc:
>       echo "69663f0bd4b60df09991c08812a60108003fa340"

And in nvidia-docker-opensuse/Makefile:

< opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 17.09.1_ce)
---
> opensuse_leap15.0: $(addsuffix -opensuse_leap15.0, 18.09.0_ce)

After making these modifications, the compilation processes complete successfully, as well as the install process. After reloading the daemon, restarting the docker process takes noticeably longer than typical, and I get the following errors from the service:

eSubConnStateChange: 0xc42095e9f0, CONNECTING" module=grpc
Jan 28 12:39:50 zeus dockerd[19765]: time="2019-01-28T12:39:50.713328360-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713333172-08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713478303-08:00" level=warning msg="grpc: addrConn.createTransport failed to connect to {unix:///run/containerd/containerd.sock 0  <nil>}. Err :connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\". Reconnecting..." module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713536953-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42095e9f0, TRANSIENT_FAILURE" module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713687894-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420176690, TRANSIENT_FAILURE" module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713682952-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713792660-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc420176690, CONNECTING" module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713891656-08:00" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42095e9f0, CONNECTING" module=grpc
Jan 28 12:40:10 zeus dockerd[19765]: time="2019-01-28T12:40:10.713965683-08:00" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout\"" module=libcontainerd namespace=moby

When I try to run a container with the nvidia runtime I get the following error:

docker: Error response from daemon: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout": unavailable.

I noticed a reference to similar results above with regard to containerd, but haven't been able to use that info as a path to resolution.

Any assistance is greatly appreciated! I feel like I'm really close...

Any thoughts? Anyone...

Any reason why you
1) install docker from experimental package?
2) install cuda not using the provided rpm by nvidia?

The first step before installing nvidia-docker is to have docker and cuda installed. Both are available for open suse leap 15...
For nvidia-docker, follow instructions above

Answering your questions:

  1. I wanted the latest version of docker and the only way to get it in packaged form is from that experimental package. Think of it like a PPA on Ubuntu -- not really experimental, but not supported directly by OpenSUSE. The reason for upgrading to begin with was to get to a newer version that supported some docker-compose functionality. If you look at nvidia-docker-opensuse/Makefile you'll see that the version supported is 17.09.1 which is quite old, so anything newer would require tweaks.
  2. If I remember correctly, the rpm file installed incorrect GPU drivers that caused conflicts with my proper drivers. I'm running a Quadro and I believe it was installing a GeForce driver.

To your point, without nvidia-docker, my docker install runs fantastic! I've had zero issues. With the CUDA install I followed some other directions and compiled the examples to check to make sure CUDA was working properly, and according to what I was following, CUDA passes as working properly.

I believe my issue to be very similar to what @amilamanoj ran into on 8/16/2018 (above), but haven't found a resolution as of yet.

I've just done some more digging and came up with a resolution, but also a question. As with the issue @amilamanoj ran into, my issue was also containerd related. By changing the ExecStart line in /usr/lib/systemd/system/docker.service.d/nvidia-docker.conf from:
ExecStart=/usr/bin/dockerd --containerd /run/containerd/containerd.sock $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS
to:
ExecStart=/usr/bin/dockerd $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS
docker now starts correctly and runs both runc and nvidia runtimes properly.

My question is, apparently I can't run docker with containerd as configured by the nvidia-docker2 package, so should I be concerned? I'm not sure why NVIDIA would prefer to have docker use containerd.

Any thoughts/advice are appreciated!

@sjordahl Thanks for you investigation. I made the same changes as you and I was able to get docker working.

```linux-x1:/home/gaal/X5 # cat /etc/os-release
NAME="openSUSE Tumbleweed"

VERSION="20190412"

ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20190412"
PRETTY_NAME="openSUSE Tumbleweed"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:tumbleweed:20190412"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

```linux-x1:/home/gaal/X5 # docker version
Client:
 Version:           18.09.3
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        774a1f4eee66
 Built:             Fri Mar 22 12:00:00 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.09.3
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       774a1f4eee66
  Built:            Fri Mar 22 12:00:00 2019
  OS/Arch:          linux/amd64
  Experimental:     false
linux-x1:/home/gaal/X5 # docker info   
Containers: 6
 Running: 0
 Paused: 0
 Stopped: 6
Images: 74
Server Version: 18.09.3
Storage Driver: btrfs
 Build Version: Btrfs v4.20.1 
 Library Version: 102
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: nvidia oci runc
Default Runtime: oci
Init Binary: docker-init
containerd version: e6b3f5632f50dbc4e9cb6288d911bf4f5e95b18e
runc version: 6635b4f0c6af3810594d2770f662f34ddc15b40d
init version: v0.1.4_catatonit (expected: fec3683b971d9c3ef73f284f176672c44b448662)
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 5.0.7-1-default
Operating System: openSUSE Tumbleweed
OSType: linux
Architecture: x86_64
CPUs: 12
Total Memory: 15.12GiB
Name: linux-x1
ID: ENLF:HOBB:SHEW:YJT3:NQBP:VF4U:3VT5:75QA:DWQJ:4O3T:US2H:EAMP
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support
linux-x1:/home/gaal/X5 # docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Tue Apr 16 12:17:24 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P8    N/A /  N/A |      0MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Hi
I forked @dev-zero 's opensuse-support branches and merge from NVIDIA v1.0.2.
thanks @dev-zero and all.

It's works openSUSE 15.0 and SLES 12SP3 (both with Docker 18.09.1_ce).

  1. https://github.com/hashio/libnvidia-container
    Please see BUILD.md.
  2. https://github.com/hashio/nvidia-container-runtime
    $ sudo make opensuse_leap15.0 or sles12sp3
  3. https://github.com/hashio/nvidia-docker
    $ sudo make opensuse_leap15.0 or sles12sp3
  4. all rpms insatall
    $ mkdir rpms
    $ find -name "*.rpm" -exec cp {} rpms \;
    $ cd rpms
    $ rpm -ivh *.rpm

I've used with containerd, but SUSE's containerd package doesn't have a systemd script.
So I added it into nvidia-docker/contrib/suse/containerd.service.
It's not included in nvidia-docker rpm. you need copy and activate it yourself.

  1. run dokcer
    systemctl start containerd.service
    systemctl start docker.service

For SLES12SP3 Build

It also works on openSUSE 15.1 with some modification.
Docker version: 18.09.1_ce
CUDA: 10.1
nVidia Driver: 430.14

Very thanks!

@hashio I've looked at the changes in your libnvidia-container fork and somehow fail to see how you can build SLES RPMs with those changes (except when building it without Docker).

anyway, now all 3 packages are rebased to respective latest master and tested with Leap 15.1

@hashio since I don't have a SLE machine with GPUs and no time for a trial setup of SLE I'll have to leave this part of rebasing to you, sorry :-/

@dev-zero rebased from your branch.
SLES12SP3 can't build now.
Other problem have happend, zypper couldn't connect to SLES repository in Docker.
I can't resolve it yet.
but opensuse15.0 is fine, very thanks :)

I solved other SLES problem. but SLES12SP3 could't work with CUDA-10.1 in docker.
I upgraded to SLES12SP4 and install CUDA-10.1 and built with SLES12SP3 Docker image then it work fine.

Was this page helpful?
0 / 5 - 0 ratings