After system reboot the following command reports an error:
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
In the same terminal the following works:
$ sudo systemctl restart docker
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Reboot system.
Run commands described before.
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.2 LTS
Release: 18.04
Codename: bionic
nvidia-container-cli -k -d /dev/tty info-- WARNING, the following logs are for debugging purposes only --
I0408 08:10:27.472786 16537 nvc.c:281] initializing library context (version=1.0.2, build=ff40da533db929bf515aca59ba4c701a65a35e6b)
I0408 08:10:27.472915 16537 nvc.c:255] using root /
I0408 08:10:27.472929 16537 nvc.c:256] using ldcache /etc/ld.so.cache
I0408 08:10:27.472942 16537 nvc.c:257] using unprivileged user 1001:1001
W0408 08:10:27.476127 16538 nvc.c:186] failed to set inheritable capabilities
W0408 08:10:27.476262 16538 nvc.c:187] skipping kernel modules load due to failure
I0408 08:10:27.477221 16539 driver.c:133] starting driver service
I0408 08:10:27.971684 16537 nvc_info.c:434] requesting driver information with ''
I0408 08:10:27.972293 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.418.39
I0408 08:10:27.972741 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.39
I0408 08:10:27.972872 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.39
I0408 08:10:27.973040 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.39
I0408 08:10:27.973195 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.39
I0408 08:10:27.973301 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.39
I0408 08:10:27.973445 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.39
I0408 08:10:27.973591 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.39
I0408 08:10:27.973694 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.39
I0408 08:10:27.973801 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.39
I0408 08:10:27.973946 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.39
I0408 08:10:27.974047 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.39
I0408 08:10:27.974195 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.39
I0408 08:10:27.974313 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.39
I0408 08:10:27.974422 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.39
I0408 08:10:27.974572 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.418.39
I0408 08:10:27.975277 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.418.39
I0408 08:10:27.975675 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.39
I0408 08:10:27.975781 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.39
I0408 08:10:27.975888 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.39
I0408 08:10:27.976002 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.39
W0408 08:10:27.976070 16537 nvc_info.c:303] missing compat32 library libnvidia-ml.so
W0408 08:10:27.976087 16537 nvc_info.c:303] missing compat32 library libnvidia-cfg.so
W0408 08:10:27.976112 16537 nvc_info.c:303] missing compat32 library libcuda.so
W0408 08:10:27.976128 16537 nvc_info.c:303] missing compat32 library libnvidia-opencl.so
W0408 08:10:27.976144 16537 nvc_info.c:303] missing compat32 library libnvidia-ptxjitcompiler.so
W0408 08:10:27.976167 16537 nvc_info.c:303] missing compat32 library libnvidia-fatbinaryloader.so
W0408 08:10:27.976191 16537 nvc_info.c:303] missing compat32 library libnvidia-compiler.so
W0408 08:10:27.976216 16537 nvc_info.c:303] missing compat32 library libvdpau_nvidia.so
W0408 08:10:27.976235 16537 nvc_info.c:303] missing compat32 library libnvidia-encode.so
W0408 08:10:27.976260 16537 nvc_info.c:303] missing compat32 library libnvidia-opticalflow.so
W0408 08:10:27.976284 16537 nvc_info.c:303] missing compat32 library libnvcuvid.so
W0408 08:10:27.976304 16537 nvc_info.c:303] missing compat32 library libnvidia-eglcore.so
W0408 08:10:27.976323 16537 nvc_info.c:303] missing compat32 library libnvidia-glcore.so
W0408 08:10:27.976342 16537 nvc_info.c:303] missing compat32 library libnvidia-tls.so
W0408 08:10:27.976364 16537 nvc_info.c:303] missing compat32 library libnvidia-glsi.so
W0408 08:10:27.976387 16537 nvc_info.c:303] missing compat32 library libnvidia-fbc.so
W0408 08:10:27.976412 16537 nvc_info.c:303] missing compat32 library libnvidia-ifr.so
W0408 08:10:27.976435 16537 nvc_info.c:303] missing compat32 library libGLX_nvidia.so
W0408 08:10:27.976460 16537 nvc_info.c:303] missing compat32 library libEGL_nvidia.so
W0408 08:10:27.976483 16537 nvc_info.c:303] missing compat32 library libGLESv2_nvidia.so
W0408 08:10:27.976506 16537 nvc_info.c:303] missing compat32 library libGLESv1_CM_nvidia.so
I0408 08:10:27.977153 16537 nvc_info.c:229] selecting /usr/bin/nvidia-smi
I0408 08:10:27.977212 16537 nvc_info.c:229] selecting /usr/bin/nvidia-debugdump
I0408 08:10:27.977272 16537 nvc_info.c:229] selecting /usr/bin/nvidia-persistenced
I0408 08:10:27.977333 16537 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-control
I0408 08:10:27.977389 16537 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-server
I0408 08:10:27.977466 16537 nvc_info.c:366] listing device /dev/nvidiactl
I0408 08:10:27.977483 16537 nvc_info.c:366] listing device /dev/nvidia-uvm
I0408 08:10:27.977503 16537 nvc_info.c:366] listing device /dev/nvidia-uvm-tools
I0408 08:10:27.977519 16537 nvc_info.c:366] listing device /dev/nvidia-modeset
W0408 08:10:27.977590 16537 nvc_info.c:274] missing ipc /var/run/nvidia-persistenced/socket
W0408 08:10:27.977635 16537 nvc_info.c:274] missing ipc /tmp/nvidia-mps
I0408 08:10:27.977653 16537 nvc_info.c:490] requesting device information with ''
I0408 08:10:27.984629 16537 nvc_info.c:520] listing device /dev/nvidia0 (GPU-db006224-734b-0e9d-8342-81b6f1c1bfd9 at 00000000:06:00.0)
NVRM version: 418.39
CUDA version: 10.1
Device Index: 0
Device Minor: 0
Model: Quadro P600
Brand: Quadro
GPU UUID: GPU-db006224-734b-0e9d-8342-81b6f1c1bfd9
Bus Location: 00000000:06:00.0
Architecture: 6.1
I0408 08:10:27.984744 16537 nvc.c:318] shutting down library context
I0408 08:10:27.985829 16539 driver.c:192] terminating driver service
I0408 08:10:28.231199 16537 driver.c:233] driver service terminated successfully
uname -aLinux XXXXXX 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
docker versionClient:
Version: 18.09.4
API version: 1.39
Go version: go1.10.8
Git commit: d14af54266
Built: Wed Mar 27 18:35:44 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.4
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: d14af54
Built: Wed Mar 27 18:01:48 2019
OS/Arch: linux/amd64
Experimental: false
dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-============================-===================-===================-=============================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-container-tools 1.0.2-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.2-1 amd64 NVIDIA container runtime library
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-container-runtime 2.0.0+docker18.09.4 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hoo 1.4.0-1 amd64 NVIDIA container runtime hook
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.0.3+docker18.09.4 all nvidia-docker CLI wrapper
un nvidia-legacy-340xx-vdpau-dr <none> <none> (no description available)
un nvidia-prime <none> <none> (no description available)
un nvidia-vdpau-driver <none> <none> (no description available)
nvidia-container-cli -Vversion: 1.0.2
build date: 2019-03-26T03:58+00:00
build revision: ff40da533db929bf515aca59ba4c701a65a35e6b
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Sorry for the delay,
seems like your docker daemon isn't setup properly, are you able to reproduce this behavior reliably?
What is the content of /etc/docker/daemon.json?
Hi there,
The bug reproduces on every reboot. I am wondering if there is a conflict on any daemons initialization order.
$ cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Hmm, do you have the contents of the docker systemd unit file?
cat /etc/systemd/system/sockets.target.wants/docker.socket
[Unit]
Description=Docker Socket for the API
PartOf=docker.service
[Socket]
ListenStream=/var/run/docker.sock
SocketMode=0660
SocketUser=root
SocketGroup=docker
[Install]
WantedBy=sockets.target
cat /etc/systemd/system/multi-user.target.wants/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always
# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3
# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this option.
TasksMax=infinity
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
[Install]
WantedBy=multi-user.target
Sorry for the long delay.
Setting your unit file with the runtime should solve your problem.
See an example here: https://github.com/NVIDIA/nvidia-container-runtime#systemd-drop-in-file
Same issue here. Answer above from @RenaudWasTaken doesn't solve it for me.
@LuisAyuso How about you?
Hi,
I did use the fix in https://github.com/NVIDIA/nvidia-container-runtime#systemd-drop-in-file
after restarting the system, docker still does not find nvidia runtime.
$ docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
when restarting docker manually, everithing seems to work ok.
$ sudo systemctl restart docker
$ docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
Tue May 7 10:48:46 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P600 Off | 00000000:06:00.0 Off | N/A |
| 30% 43C P0 N/A / N/A | 0MiB / 1999MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
therefore the issue continues open.
I reinstalled docker and nvidia-docker (purging everything first, as in: link) and also removed snap version of docker (sudo snap remove docker) that was installed on my system. Works fine now, even after reboot.
I did re-install the services as well, I had to madison my way in. but is working now.
I came to the conclusion that the error was produced by a docker snap installation that competes for the service on system boot. I did not have a clue that docker could be installed from snap or how it made into the system. Nevertheless the issue is no longer there and nvidia-containers work correctly.
thanks @RenaudWasTaken and @krolikowskib for your help.
Most helpful comment
I reinstalled docker and nvidia-docker (purging everything first, as in: link) and also removed snap version of docker (
sudo snap remove docker) that was installed on my system. Works fine now, even after reboot.