Nvidia-docker: Unknown runtime specified nvidia after system reboot. Working after restart docker daemon.

Created on 8 Apr 2019  路  9Comments  路  Source: NVIDIA/nvidia-docker

1. Issue or feature description

After system reboot the following command reports an error:

$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.

In the same terminal the following works:

$   sudo systemctl restart docker
$ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi

2. Steps to reproduce the issue

Reboot system.
Run commands described before.

3. Information

  • Linux Distribution
lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:    18.04
Codename:   bionic
  • [ ] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --

I0408 08:10:27.472786 16537 nvc.c:281] initializing library context (version=1.0.2, build=ff40da533db929bf515aca59ba4c701a65a35e6b)
I0408 08:10:27.472915 16537 nvc.c:255] using root /
I0408 08:10:27.472929 16537 nvc.c:256] using ldcache /etc/ld.so.cache
I0408 08:10:27.472942 16537 nvc.c:257] using unprivileged user 1001:1001
W0408 08:10:27.476127 16538 nvc.c:186] failed to set inheritable capabilities
W0408 08:10:27.476262 16538 nvc.c:187] skipping kernel modules load due to failure
I0408 08:10:27.477221 16539 driver.c:133] starting driver service
I0408 08:10:27.971684 16537 nvc_info.c:434] requesting driver information with ''
I0408 08:10:27.972293 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.418.39
I0408 08:10:27.972741 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.418.39
I0408 08:10:27.972872 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.39
I0408 08:10:27.973040 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.418.39
I0408 08:10:27.973195 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.39
I0408 08:10:27.973301 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.39
I0408 08:10:27.973445 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.418.39
I0408 08:10:27.973591 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.418.39
I0408 08:10:27.973694 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.418.39
I0408 08:10:27.973801 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.418.39
I0408 08:10:27.973946 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.39
I0408 08:10:27.974047 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.418.39
I0408 08:10:27.974195 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.418.39
I0408 08:10:27.974313 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.39
I0408 08:10:27.974422 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.39
I0408 08:10:27.974572 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.418.39
I0408 08:10:27.975277 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.418.39
I0408 08:10:27.975675 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.418.39
I0408 08:10:27.975781 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.418.39
I0408 08:10:27.975888 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.418.39
I0408 08:10:27.976002 16537 nvc_info.c:148] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.418.39
W0408 08:10:27.976070 16537 nvc_info.c:303] missing compat32 library libnvidia-ml.so
W0408 08:10:27.976087 16537 nvc_info.c:303] missing compat32 library libnvidia-cfg.so
W0408 08:10:27.976112 16537 nvc_info.c:303] missing compat32 library libcuda.so
W0408 08:10:27.976128 16537 nvc_info.c:303] missing compat32 library libnvidia-opencl.so
W0408 08:10:27.976144 16537 nvc_info.c:303] missing compat32 library libnvidia-ptxjitcompiler.so
W0408 08:10:27.976167 16537 nvc_info.c:303] missing compat32 library libnvidia-fatbinaryloader.so
W0408 08:10:27.976191 16537 nvc_info.c:303] missing compat32 library libnvidia-compiler.so
W0408 08:10:27.976216 16537 nvc_info.c:303] missing compat32 library libvdpau_nvidia.so
W0408 08:10:27.976235 16537 nvc_info.c:303] missing compat32 library libnvidia-encode.so
W0408 08:10:27.976260 16537 nvc_info.c:303] missing compat32 library libnvidia-opticalflow.so
W0408 08:10:27.976284 16537 nvc_info.c:303] missing compat32 library libnvcuvid.so
W0408 08:10:27.976304 16537 nvc_info.c:303] missing compat32 library libnvidia-eglcore.so
W0408 08:10:27.976323 16537 nvc_info.c:303] missing compat32 library libnvidia-glcore.so
W0408 08:10:27.976342 16537 nvc_info.c:303] missing compat32 library libnvidia-tls.so
W0408 08:10:27.976364 16537 nvc_info.c:303] missing compat32 library libnvidia-glsi.so
W0408 08:10:27.976387 16537 nvc_info.c:303] missing compat32 library libnvidia-fbc.so
W0408 08:10:27.976412 16537 nvc_info.c:303] missing compat32 library libnvidia-ifr.so
W0408 08:10:27.976435 16537 nvc_info.c:303] missing compat32 library libGLX_nvidia.so
W0408 08:10:27.976460 16537 nvc_info.c:303] missing compat32 library libEGL_nvidia.so
W0408 08:10:27.976483 16537 nvc_info.c:303] missing compat32 library libGLESv2_nvidia.so
W0408 08:10:27.976506 16537 nvc_info.c:303] missing compat32 library libGLESv1_CM_nvidia.so
I0408 08:10:27.977153 16537 nvc_info.c:229] selecting /usr/bin/nvidia-smi
I0408 08:10:27.977212 16537 nvc_info.c:229] selecting /usr/bin/nvidia-debugdump
I0408 08:10:27.977272 16537 nvc_info.c:229] selecting /usr/bin/nvidia-persistenced
I0408 08:10:27.977333 16537 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-control
I0408 08:10:27.977389 16537 nvc_info.c:229] selecting /usr/bin/nvidia-cuda-mps-server
I0408 08:10:27.977466 16537 nvc_info.c:366] listing device /dev/nvidiactl
I0408 08:10:27.977483 16537 nvc_info.c:366] listing device /dev/nvidia-uvm
I0408 08:10:27.977503 16537 nvc_info.c:366] listing device /dev/nvidia-uvm-tools
I0408 08:10:27.977519 16537 nvc_info.c:366] listing device /dev/nvidia-modeset
W0408 08:10:27.977590 16537 nvc_info.c:274] missing ipc /var/run/nvidia-persistenced/socket
W0408 08:10:27.977635 16537 nvc_info.c:274] missing ipc /tmp/nvidia-mps
I0408 08:10:27.977653 16537 nvc_info.c:490] requesting device information with ''
I0408 08:10:27.984629 16537 nvc_info.c:520] listing device /dev/nvidia0 (GPU-db006224-734b-0e9d-8342-81b6f1c1bfd9 at 00000000:06:00.0)
NVRM version:   418.39
CUDA version:   10.1

Device Index:   0
Device Minor:   0
Model:          Quadro P600
Brand:          Quadro
GPU UUID:       GPU-db006224-734b-0e9d-8342-81b6f1c1bfd9
Bus Location:   00000000:06:00.0
Architecture:   6.1
I0408 08:10:27.984744 16537 nvc.c:318] shutting down library context
I0408 08:10:27.985829 16539 driver.c:192] terminating driver service
I0408 08:10:28.231199 16537 driver.c:233] driver service terminated successfully
  • [ ] Kernel version from uname -a
Linux XXXXXX 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • [ ] Docker version from docker version
Client:
 Version:           18.09.4
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        d14af54266
 Built:             Wed Mar 27 18:35:44 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.4
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       d14af54
  Built:            Wed Mar 27 18:01:48 2019
  OS/Arch:          linux/amd64
  Experimental:     false
  • [ ] NVIDIA packages version from dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                         Version             Architecture        Description
+++-============================-===================-===================-=============================================================
un  libgldispatch0-nvidia        <none>              <none>              (no description available)
ii  libnvidia-container-tools    1.0.2-1             amd64               NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64   1.0.2-1             amd64               NVIDIA container runtime library
un  nvidia-304                   <none>              <none>              (no description available)
un  nvidia-340                   <none>              <none>              (no description available)
un  nvidia-384                   <none>              <none>              (no description available)
un  nvidia-common                <none>              <none>              (no description available)
ii  nvidia-container-runtime     2.0.0+docker18.09.4 amd64               NVIDIA container runtime
ii  nvidia-container-runtime-hoo 1.4.0-1             amd64               NVIDIA container runtime hook
un  nvidia-docker                <none>              <none>              (no description available)
ii  nvidia-docker2               2.0.3+docker18.09.4 all                 nvidia-docker CLI wrapper
un  nvidia-legacy-340xx-vdpau-dr <none>              <none>              (no description available)
un  nvidia-prime                 <none>              <none>              (no description available)
un  nvidia-vdpau-driver          <none>              <none>              (no description available)
  • [x ] NVIDIA container library version from nvidia-container-cli -V
version: 1.0.2
build date: 2019-03-26T03:58+00:00
build revision: ff40da533db929bf515aca59ba4c701a65a35e6b
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Most helpful comment

I reinstalled docker and nvidia-docker (purging everything first, as in: link) and also removed snap version of docker (sudo snap remove docker) that was installed on my system. Works fine now, even after reboot.

All 9 comments

Sorry for the delay,
seems like your docker daemon isn't setup properly, are you able to reproduce this behavior reliably?
What is the content of /etc/docker/daemon.json?

Hi there,

The bug reproduces on every reboot. I am wondering if there is a conflict on any daemons initialization order.

$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Hmm, do you have the contents of the docker systemd unit file?

 cat /etc/systemd/system/sockets.target.wants/docker.socket
[Unit]
Description=Docker Socket for the API
PartOf=docker.service

[Socket]
ListenStream=/var/run/docker.sock
SocketMode=0660
SocketUser=root
SocketGroup=docker

[Install]
WantedBy=sockets.target
 cat /etc/systemd/system/multi-user.target.wants/docker.service
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
BindsTo=containerd.service
After=network-online.target firewalld.service containerd.service
Wants=network-online.target
Requires=docker.socket

[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ExecReload=/bin/kill -s HUP $MAINPID
TimeoutSec=0
RestartSec=2
Restart=always

# Note that StartLimit* options were moved from "Service" to "Unit" in systemd 229.
# Both the old, and new location are accepted by systemd 229 and up, so using the old location
# to make them work for either version of systemd.
StartLimitBurst=3

# Note that StartLimitInterval was renamed to StartLimitIntervalSec in systemd 230.
# Both the old, and new name are accepted by systemd 230 and up, so using the old name to make
# this option work for either version of systemd.
StartLimitInterval=60s

# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this option.
TasksMax=infinity

# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes

# kill only the docker process, not all processes in the cgroup
KillMode=process

[Install]
WantedBy=multi-user.target

Sorry for the long delay.
Setting your unit file with the runtime should solve your problem.
See an example here: https://github.com/NVIDIA/nvidia-container-runtime#systemd-drop-in-file

Same issue here. Answer above from @RenaudWasTaken doesn't solve it for me.
@LuisAyuso How about you?

Hi,
I did use the fix in https://github.com/NVIDIA/nvidia-container-runtime#systemd-drop-in-file

after restarting the system, docker still does not find nvidia runtime.

$ docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.

when restarting docker manually, everithing seems to work ok.

$ sudo systemctl restart docker

$ docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
Tue May  7 10:48:46 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P600         Off  | 00000000:06:00.0 Off |                  N/A |
| 30%   43C    P0    N/A /  N/A |      0MiB /  1999MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

therefore the issue continues open.

I reinstalled docker and nvidia-docker (purging everything first, as in: link) and also removed snap version of docker (sudo snap remove docker) that was installed on my system. Works fine now, even after reboot.

I did re-install the services as well, I had to madison my way in. but is working now.
I came to the conclusion that the error was produced by a docker snap installation that competes for the service on system boot. I did not have a clue that docker could be installed from snap or how it made into the system. Nevertheless the issue is no longer there and nvidia-containers work correctly.

thanks @RenaudWasTaken and @krolikowskib for your help.

Was this page helpful?
0 / 5 - 0 ratings