Nvidia-Docker stopped working.
I had a jupyterhub running with nvidia-docker supported and it worked quite well.
Today I logged into the host system and ran sudo apt-get update/upgrade, and somehow, suddenly Nvidia-Docker does not work anymore. That said I can't recall if the upgrade actually did something so that might not be the root of the issue.
System runs debian.
sudo docker run --rm nvidia/cuda:8.0-devel nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=8.0 --pid=25807 /var/lib/docker/overlay2/8127e7486398ec495fc98de2cee1f18e769ee97f43211ccbc455a058d3b3923a/merged]\\\\nnvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\\\n\\\"\"": unknown.
$uname -a
Linux donna 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux
$ docker version
Client:
Version: 17.12.0-ce
API version: 1.35
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:11:19 2017
OS/Arch: linux/amd64
Server:
Engine:
Version: 17.12.0-ce
API version: 1.35 (minimum version 1.12)
Go version: go1.9.2
Git commit: c97c6d6
Built: Wed Dec 27 20:09:54 2017
OS/Arch: linux/amd64
Experimental: false
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82 Driver Version: 375.82 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 0000:41:00.0 Off | N/A |
| 0% 23C P0 55W / 250W | 0MiB / 11170MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-container-cli -V
version: 1.0.0
build date: 2018-01-11T00:29+00:00
build revision: 4a618459e8ba522d834bb2b4c665847fae8ce0ad
build compiler: x86_64-linux-gnu-gcc-6 6.3.0 20170516
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Sorry for causing the trouble, it seems that I had the wrong sources list installed. To everyone running Debian and having this issue: Make sure you get your stuff from here: https://nvidia.github.io/nvidia-docker/
You were probably using the Ubuntu packages instead of the Debian ones.
Is it possible to have other causation?
I have exactly the same issue on the same platform(debian stretch) but I installed from the right repository.
@khallaghi I believe so. I first got hit by #677, then this one.
This is however not a Debian stretch, but a mix of testing and unstable.
My workaround was to symlink /sbin/ldconfig to /sbin/ldconfig.real
met the same problem,thanks @sleveque
sudo docker run --gpus all nvidia/cuda:9.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: open failed: /sbin/ldconfig.real: no such file or directory\\n\""": unknown.
ERRO[0000] error waiting for container: context canceled
Solution:
ln -s /sbin/ldconfig /sbin/ldconfig.real
Most helpful comment
@khallaghi I believe so. I first got hit by #677, then this one.
This is however not a Debian stretch, but a mix of testing and unstable.
My workaround was to symlink
/sbin/ldconfigto/sbin/ldconfig.real