Can not run any containers with --gpus. Results in the following error:
docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"process_linux.go:432:
running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: ,
stderr: nvidia-container-cli: detection error: driver error: failed to process request\\\\n\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Run docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
nvidia-container-cli -k -d /dev/tty info-- WARNING, the following logs are for debugging purposes only --
I1105 09:36:05.700922 25049 nvc.c:281] initializing library context (version=1.0.6, build=0000000000000000000000000000000000000000)
I1105 09:36:05.701033 25049 nvc.c:255] using root /
I1105 09:36:05.701054 25049 nvc.c:256] using ldcache /etc/ld.so.cache
I1105 09:36:05.701068 25049 nvc.c:257] using unprivileged user 65534:65534
I1105 09:36:05.704115 25050 nvc.c:191] loading kernel module nvidia
I1105 09:36:05.704692 25050 nvc.c:203] loading kernel module nvidia_uvm
I1105 09:36:05.705045 25050 nvc.c:211] loading kernel module nvidia_modeset
I1105 09:36:05.705877 25051 driver.c:133] starting driver service
I1105 09:36:05.763137 25049 nvc_info.c:437] requesting driver information with ''
nvidia-container-cli: detection error: driver error: failed to process request
I1105 09:36:05.763273 25049 nvc.c:318] shutting down library context
W1105 09:36:05.773442 25049 driver.c:220] terminating driver service (forced)
I1105 09:36:05.855677 25049 driver.c:233] driver service terminated with signal 9
[X] Kernel version from uname -a
5.3.0-20-generic #21+system76~1572310493~18.04~b3805b2-Ubuntu
[X] NVIDIA packages version from dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=========================================-=========================-=========================-========================================================================================
un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-common <none> <none> (no description available)
ii libnvidia-common-435 435.21-1pop1~1571925584~1 all Shared files used by the NVIDIA libraries
rc libnvidia-compute-418:amd64 430.34-1pop1~1563200531~1 amd64 Transitional package for libnvidia-compute-430
rc libnvidia-compute-430:amd64 435.21-1pop1~1567200870~1 amd64 Transitional package for libnvidia-compute-435
ii libnvidia-compute-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA libcompute package
ii libnvidia-compute-435:i386 435.21-1pop1~1571925584~1 i386 NVIDIA libcompute package
ii libnvidia-container-tools 1.0.6-1pop1~1571281295~18 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.6-1pop1~1571281295~18 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-435:i386 435.21-1pop1~1571925584~1 i386 NVIDIA Video Decoding runtime libraries
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-435:amd64 435.21-1pop1~1571925584~1 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-435:i386 435.21-1pop1~1571925584~1 i386 NVENC Video Encoding runtime library
un libnvidia-fbc1 <none> <none> (no description available)
ii libnvidia-fbc1-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-435:i386 435.21-1pop1~1571925584~1 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl <none> <none> (no description available)
ii libnvidia-gl-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-435:i386 435.21-1pop1~1571925584~1 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-ifr1 <none> <none> (no description available)
ii libnvidia-ifr1-435:amd64 435.21-1pop1~1571925584~1 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii libnvidia-ifr1-435:i386 435.21-1pop1~1571925584~1 i386 NVIDIA OpenGL-based Inband Frame Readback runtime library
un libnvidia-ml1 <none> <none> (no description available)
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
ii nvidia-compute-utils-435 435.21-1pop1~1571925584~1 amd64 NVIDIA compute utilities
ii nvidia-container-runtime 3.1.4-0pop1~1569270714~18 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook <none> <none> (no description available)
ii nvidia-container-toolkit 1.0.5-0pop1~1569270707~18 amd64 NVIDIA container runtime hook
ii nvidia-dkms-435 435.21-1pop1~1571925584~1 amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
ii nvidia-driver-435 435.21-1pop1~1571925584~1 amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
ii nvidia-kernel-common-435 435.21-1pop1~1571925584~1 amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
ii nvidia-kernel-source-435 435.21-1pop1~1571925584~1 amd64 NVIDIA kernel source package
un nvidia-legacy-340xx-vdpau-driver <none> <none> (no description available)
un nvidia-opencl-icd <none> <none> (no description available)
un nvidia-persistenced <none> <none> (no description available)
un nvidia-prime <none> <none> (no description available)
ii nvidia-settings 390.77-0ubuntu0.18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-435 435.21-1pop1~1571925584~1 amd64 NVIDIA driver support binaries
un nvidia-vdpau-driver <none> <none> (no description available)
ii system76-driver-nvidia 19.04.20~1572357231~18.04 all Latest nvidia driver for System76 computers
ii xserver-xorg-video-nvidia-435 435.21-1pop1~1571925584~1 amd64 NVIDIA binary Xorg driver
nvidia-container-cli -Vbuild date: 2019-10-17T03:23+00:00
build revision: 0000000000000000000000000000000000000000
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -I/usr/include/tirpc -g -O2 -fdebug-prefix-map=/build/libnvidia-container-Q2bRCn/libnvidia-container-1.0.6=. -fstack-protector-strong -Wformat -Werror=format-security -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-Bsymbolic-functions -Wl,-z,relro
-- WARNING, the following logs are for debugging purposes only --
I1105 09:17:34.548451 19203 nvc.c:281] initializing library context (version=1.0.6, build=0000000000000000000000000000000000000000)
I1105 09:17:34.548533 19203 nvc.c:255] using root /
I1105 09:17:34.548545 19203 nvc.c:256] using ldcache /etc/ld.so.cache
I1105 09:17:34.548556 19203 nvc.c:257] using unprivileged user 65534:65534
I1105 09:17:34.550684 19209 nvc.c:191] loading kernel module nvidia
I1105 09:17:34.551097 19209 nvc.c:203] loading kernel module nvidia_uvm
I1105 09:17:34.551337 19209 nvc.c:211] loading kernel module nvidia_modeset
I1105 09:17:34.551901 19210 driver.c:133] starting driver service
I1105 09:17:34.612315 19203 nvc_container.c:364] configuring container with 'compute utility supervised'
I1105 09:17:34.612713 19203 nvc_container.c:212] selecting /var/lib/docker/overlay2/a69837ca5f4f81a5d68b974b6e48b03ab82e17fa7d24c2875e9df8bb58040ddd/merged/usr/local/cuda-10.0/compat/libcuda.so.410.104
I1105 09:17:34.612832 19203 nvc_container.c:212] selecting /var/lib/docker/overlay2/a69837ca5f4f81a5d68b974b6e48b03ab82e17fa7d24c2875e9df8bb58040ddd/merged/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.104
I1105 09:17:34.612903 19203 nvc_container.c:212] selecting /var/lib/docker/overlay2/a69837ca5f4f81a5d68b974b6e48b03ab82e17fa7d24c2875e9df8bb58040ddd/merged/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.104
I1105 09:17:34.613167 19203 nvc_container.c:384] setting pid to 19190
I1105 09:17:34.613190 19203 nvc_container.c:385] setting rootfs to /var/lib/docker/overlay2/a69837ca5f4f81a5d68b974b6e48b03ab82e17fa7d24c2875e9df8bb58040ddd/merged
I1105 09:17:34.613206 19203 nvc_container.c:386] setting owner to 0:0
I1105 09:17:34.613234 19203 nvc_container.c:387] setting bins directory to /usr/bin
I1105 09:17:34.613251 19203 nvc_container.c:388] setting libs directory to /usr/lib/x86_64-linux-gnu
I1105 09:17:34.613268 19203 nvc_container.c:389] setting libs32 directory to /usr/lib/i386-linux-gnu
I1105 09:17:34.613284 19203 nvc_container.c:390] setting cudart directory to /usr/local/cuda
I1105 09:17:34.613300 19203 nvc_container.c:391] setting ldconfig to @/sbin/ldconfig.real (host relative)
I1105 09:17:34.613316 19203 nvc_container.c:392] setting mount namespace to /proc/19190/ns/mnt
I1105 09:17:34.613332 19203 nvc_container.c:394] setting devices cgroup to /sys/fs/cgroup/devices/docker/b64f8ae6075dd67d1927dfe37b42112a266a7288372633b594ca14b7c56eda3a
I1105 09:17:34.613356 19203 nvc_info.c:437] requesting driver information with ''
I1105 09:17:34.613470 19203 nvc.c:318] shutting down library context
W1105 09:17:34.623628 19203 driver.c:220] terminating driver service (forced)
I1105 09:17:34.708486 19203 driver.c:233] driver service terminated with signal 9
docker run --gpus all nvidia/cuda:10.0-base nvidia-smiFixed by downgrading to 430.50 and using the official NVIDIA container repo rather than the one in Pop! OS. For vanilla Ubuntu, this is straightforward but for Pop! OS you need to add and pin both the Graphics Driver PPA and nvidia-docker repo since the 430 driver in Pop is just an alias for 435.
sudo add-apt-repository ppa:graphics-drivers/ppa
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
/etc/apt/preferences.d/nvidia-docker-pin-1002
Package: *
Pin: origin nvidia.github.io
Pin-Priority: 1002
and
/etc/apt/preferences.d/nvidia-ppa-pin-1002
Package: *
Pin: release o=LP-PPA-graphics-drivers
Pin-Priority: 1002
sudo apt install nvidia-driver-430
sudo apt install nvidia-container-toolkit
systemctl restart docker
Once this is fixed in Pop! OS you can remove the repo and PPA (don't forget to remove the pinning).
Closing since this seems to be specific to Pop! OS.
Exact same issue with Pop_os, with NVIDIA drivers:440.59 and previously with 440.44,
ii nvidia-driver-440 440.59-1pop1~1584480240~1 amd64
ii nvidia-container-toolkit 1.0.5-0pop1~1569270707~18 amd64
Solution seems to be to use:
the NVIDIA repo for the nvidia-container-toolkit not the Pop one.
issue is in Pop's packaged nvidia-container-toolkit.
as suggested in:
https://github.com/pop-os/nvidia-graphics-drivers/issues/31
Follow the steps in:
https://github.com/pop-os/nvidia-container-toolkit/issues/1
but only reinstall the
sudo apt install nvidia-container-toolkit
The driver from Pop works fine. No need to reinstall.
Most helpful comment
Fixed by downgrading to 430.50 and using the official NVIDIA container repo rather than the one in Pop! OS. For vanilla Ubuntu, this is straightforward but for Pop! OS you need to add and pin both the Graphics Driver PPA and nvidia-docker repo since the 430 driver in Pop is just an alias for 435.
Steps:
1. Add the repo and PPA
2. Then pin the repo and PPA by creating:
/etc/apt/preferences.d/nvidia-docker-pin-1002
and
/etc/apt/preferences.d/nvidia-ppa-pin-1002
3. Install the driver
4. Restart your machine
5. Install nvidia-container-toolkit and restart docker
Once this is fixed in Pop! OS you can remove the repo and PPA (don't forget to remove the pinning).
Closing since this seems to be specific to Pop! OS.