Missing libnvidia-ml.so and libcublas.9.so library in docker container.
My system is Ubuntu 18.10 and I tried with nvidia drivers 390, 396 and 410.
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
This also holds for the tensorflow docker images. When I run the cuda image in interactive mode and try to import tensorflow via python it says that libcublas.9.so is not found although I can see it in the /usr/local/cuda/lib64 directory.
Everything works fine on host machine though.
uname -aLinux box 4.18.0-10-generic #11-Ubuntu SMP Thu Oct 11 15:13:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
dmesgnvidia-smi -a==============NVSMI LOG==============
Timestamp : Fri Nov 2 11:09:45 2018
Driver Version : 410.73
CUDA Version : 10.0
Attached GPUs : 1
GPU 00000000:65:00.0
Product Name : GeForce GTX 1080 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-14bfddbd-9230-c05e-fa52-d468af601fc4
Minor Number : 0
VBIOS Version : 86.02.39.00.2E
MultiGPU Board : No
Board ID : 0x6500
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x65
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 00000000:65:00.0
Sub System Id : 0x147019DA
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 3000 KB/s
Rx Throughput : 2000 KB/s
Fan Speed : 0 %
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 11177 MiB
Used : 751 MiB
Free : 10426 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 3 %
Memory : 1 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 35 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 60.67 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 1480 MHz
SM : 1480 MHz
Memory : 5508 MHz
Video : 1265 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5505 MHz
Video : 1620 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1454
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 40 MiB
Process ID : 1533
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 80 MiB
Process ID : 2450
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 363 MiB
Process ID : 2631
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 142 MiB
Process ID : 3068
Type : G
Name : /opt/google/chrome/chrome --type=gpu-process --field-trial-handle=15466691898050642703,2714747135580672923,131072 --enable-crash-reporter=b6227030-26a9-487c-b99f-efddda704fbf, --gpu-preferences=KAAAAAAAAACAAABAAQAAAAAAAAAAAGAAAAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAA --enable-crash-reporter=b6227030-26a9-487c-b99f-efddda704fbf, --service-request-channel-token=405587616121577545
Used GPU Memory : 121 MiB
docker versionClient:
Version: 18.06.1-ce
API version: 1.38
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:24:51 2018
OS/Arch: linux/amd64
Experimental: false
Server:
Engine:
Version: 18.06.1-ce
API version: 1.38 (minimum version 1.12)
Go version: go1.10.3
Git commit: e68fc7a
Built: Tue Aug 21 17:23:15 2018
OS/Arch: linux/amd64
Experimental: false
dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'un libgldispatch0-nvidia <none> <none> (no description available)
ii libnvidia-cfg1-410:amd64 410.73-0ubuntu0~gp amd64 NVIDIA binary OpenGL/GLX configuration library
un libnvidia-cfg1-any <none> <none> (no description available)
un libnvidia-common <none> <none> (no description available)
ii libnvidia-common-410 410.73-0ubuntu0~gp all Shared files used by the NVIDIA libraries
rc libnvidia-compute-390:amd6 390.87-0ubuntu1 amd64 NVIDIA libcompute package
rc libnvidia-compute-390:i386 390.87-0ubuntu1 i386 NVIDIA libcompute package
rc libnvidia-compute-396:amd6 396.54-0ubuntu0~gp amd64 NVIDIA libcompute package
rc libnvidia-compute-396:i386 396.54-0ubuntu0~gp i386 NVIDIA libcompute package
ii libnvidia-compute-410:amd6 410.73-0ubuntu0~gp amd64 NVIDIA libcompute package
ii libnvidia-compute-410:i386 410.73-0ubuntu0~gp i386 NVIDIA libcompute package
ii libnvidia-container-tools 1.0.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.0-1 amd64 NVIDIA container runtime library
un libnvidia-decode <none> <none> (no description available)
ii libnvidia-decode-410:amd64 410.73-0ubuntu0~gp amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-410:i386 410.73-0ubuntu0~gp i386 NVIDIA Video Decoding runtime libraries
un libnvidia-encode <none> <none> (no description available)
ii libnvidia-encode-410:amd64 410.73-0ubuntu0~gp amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-410:i386 410.73-0ubuntu0~gp i386 NVENC Video Encoding runtime library
un libnvidia-fbc1 <none> <none> (no description available)
ii libnvidia-fbc1-410:amd64 410.73-0ubuntu0~gp amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-410:i386 410.73-0ubuntu0~gp i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
un libnvidia-gl <none> <none> (no description available)
ii libnvidia-gl-410:amd64 410.73-0ubuntu0~gp amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-410:i386 410.73-0ubuntu0~gp i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un libnvidia-ifr1 <none> <none> (no description available)
ii libnvidia-ifr1-410:amd64 410.73-0ubuntu0~gp amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii libnvidia-ifr1-410:i386 410.73-0ubuntu0~gp i386 NVIDIA OpenGL-based Inband Frame Readback runtime library
un nvidia-304 <none> <none> (no description available)
un nvidia-340 <none> <none> (no description available)
un nvidia-384 <none> <none> (no description available)
un nvidia-390 <none> <none> (no description available)
un nvidia-common <none> <none> (no description available)
rc nvidia-compute-utils-390 390.87-0ubuntu1 amd64 NVIDIA compute utilities
rc nvidia-compute-utils-396 396.54-0ubuntu0~gp amd64 NVIDIA compute utilities
ii nvidia-compute-utils-410 410.73-0ubuntu0~gp amd64 NVIDIA compute utilities
ii nvidia-container-runtime 2.0.0+docker18.06. amd64 NVIDIA container runtime
ii nvidia-container-runtime-h 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-cuda-dev 9.1.85-4ubuntu1 amd64 NVIDIA CUDA development files
ii nvidia-cuda-doc 9.1.85-4ubuntu1 all NVIDIA CUDA and OpenCL documentation
ii nvidia-cuda-gdb 9.1.85-4ubuntu1 amd64 NVIDIA CUDA Debugger (GDB)
ii nvidia-cuda-toolkit 9.1.85-4ubuntu1 amd64 NVIDIA CUDA development toolkit
rc nvidia-dkms-390 390.87-0ubuntu1 amd64 NVIDIA DKMS package
rc nvidia-dkms-396 396.54-0ubuntu0~gp amd64 NVIDIA DKMS package
ii nvidia-dkms-410 410.73-0ubuntu0~gp amd64 NVIDIA DKMS package
un nvidia-dkms-kernel <none> <none> (no description available)
un nvidia-docker <none> <none> (no description available)
ii nvidia-docker2 2.0.3+docker18.06. all nvidia-docker CLI wrapper
un nvidia-driver <none> <none> (no description available)
ii nvidia-driver-410 410.73-0ubuntu0~gp amd64 NVIDIA driver metapackage
un nvidia-driver-binary <none> <none> (no description available)
un nvidia-kernel-common <none> <none> (no description available)
rc nvidia-kernel-common-390 390.87-0ubuntu1 amd64 Shared files used with the kernel module
rc nvidia-kernel-common-396 396.54-0ubuntu0~gp amd64 Shared files used with the kernel module
ii nvidia-kernel-common-410 410.73-0ubuntu0~gp amd64 Shared files used with the kernel module
un nvidia-kernel-source <none> <none> (no description available)
un nvidia-kernel-source-390 <none> <none> (no description available)
un nvidia-kernel-source-396 <none> <none> (no description available)
ii nvidia-kernel-source-410 410.73-0ubuntu0~gp amd64 NVIDIA kernel source package
un nvidia-legacy-304xx-vdpau- <none> <none> (no description available)
un nvidia-legacy-340xx-vdpau- <none> <none> (no description available)
un nvidia-libopencl1 <none> <none> (no description available)
un nvidia-libopencl1-dev <none> <none> (no description available)
ii nvidia-opencl-dev:amd64 9.1.85-4ubuntu1 amd64 NVIDIA OpenCL development files
un nvidia-opencl-icd <none> <none> (no description available)
ii nvidia-openjdk-8-jre 9.1.85-4ubuntu1 amd64 NVIDIA provided OpenJDK Java runtime, using Hotspot JIT
un nvidia-persistenced <none> <none> (no description available)
ii nvidia-prime 0.8.10 all Tools to enable NVIDIA's Prime
ii nvidia-profiler 9.1.85-4ubuntu1 amd64 NVIDIA Profiler for CUDA and OpenCL
ii nvidia-settings 410.73-0ubuntu0~gp amd64 Tool for configuring the NVIDIA graphics driver
un nvidia-settings-binary <none> <none> (no description available)
un nvidia-smi <none> <none> (no description available)
un nvidia-utils <none> <none> (no description available)
ii nvidia-utils-410 410.73-0ubuntu0~gp amd64 NVIDIA driver support binaries
un nvidia-vdpau-driver <none> <none> (no description available)
ii nvidia-visual-profiler 9.1.85-4ubuntu1 amd64 NVIDIA Visual Profiler for CUDA and OpenCL
ii xserver-xorg-video-nvidia- 410.73-0ubuntu0~gp amd64 NVIDIA binary Xorg driver
dpkg-query: no packages found matching *nvidia*rpm
dpkg-query: no packages found matching -qa
nvidia-container-cli -Vversion: 1.0.0
build date: 2018-09-20T20:19+00:00
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: x86_64-linux-gnu-gcc-7 7.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Do I have to install any other libraries apart from the nvidia drivers on the host machine ?
Bumping up. I am having exactly the same problem, also on Ubuntu 18.10 and driver version 390.87.
I have similar symptoms but I can run nvidia-smi after executing ldconfig inside the container. I'm using driver version 410.73.
>docker run --runtime=nvidia --rm -it nvidia/cuda:9.0-base bash
root@9b2ab11c3ff9:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@9b2ab11c3ff9:/# ldconfig
root@9b2ab11c3ff9:/# nvidia-smi
Fri Nov 2 15:35:25 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.73 Driver Version: 410.73 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 Off | N/A |
| N/A 49C P8 N/A / N/A | 289MiB / 4040MiB | 14% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
@symmsaur Can you import TensorFlow in the docker after ldconfig ?
Mmm, given the symptoms, you are probably stumbling in the issue that was fixed by this commit:
https://github.com/NVIDIA/libnvidia-container/commit/deccb2801502675bd283c6936861814dbca99ecd
You would need to wait for the next release of the library.
Mmm, given the symptoms, you are probably stumbling in the issue that was fixed by this commit:
NVIDIA/libnvidia-container@deccb28
You would need to wait for the next release of the library.
Thanks for the information. I will wait for the next release or compile libnvidia-container myself if I can't resolve it until then. Running ldconfig manually helped though !
Many thanks to @symmsaur...
Best
The same for me.
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base ldconfig && nvidia-smi
works, without ldconfig fails with the same error.
Running ldconfig inside the container fixes any issue with failing to resolve .so libraries (actually resolving on the host system): so tensorflow(image nvcr.io/nvidia/tensorflow:18.09-py3) imports and runs fine after that.
Confirm that the above mentioned commit fixes the problem: re-compiled the latest master branch and replaced the library in my system path - now ngc tensorflow image works out of the box on ubuntu 18.10, nvidia driver 415.25
The same for me.
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base ldconfig && nvidia-smi
@ddurnev you probably ran nvidia-smi on the host (compare the Processes list inside and out of the container)
@lccro Yes, you're right - this runs only the first part "ldconfig" inside the container, correct is smth like:
docker run --runtime=nvidia --rm nvidia/cuda:9.0-base /bin/bash -c "ldconfig && nvidia-smi"
Still running nvidia-smi works without ldconfig only after the patch for libnvidia-container is applied.
I have the same problem on Fedora 29, with nvidia 415 driver and nvidia-docker 2.0.3
$ sudo docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
473ede7ed136: Pull complete
c46b5fa4d940: Pull complete
93ae3df89c92: Pull complete
6b1eed27cade: Pull complete
cb5511f09cc0: Pull complete
4173c1e5c714: Pull complete
Digest: sha256:7ba25f8ec32821f4225a73d6cd3df5ccf70ecc9622724f64c61b123f2bde5b90
Status: Downloaded newer image for nvidia/cuda:10.0-base
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
But on host it works well:
$ nvidia-smi
Thu Jan 3 14:11:28 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:09:00.0 On | N/A |
| 28% 30C P8 8W / 180W | 341MiB / 8116MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1328 G /usr/libexec/Xorg 40MiB |
| 0 1510 G /usr/bin/gnome-shell 48MiB |
| 0 1806 G /usr/libexec/Xorg 126MiB |
| 0 1922 G /usr/bin/gnome-shell 122MiB |
+-----------------------------------------------------------------------------+
Additonal informations about nvidia card:
$ whereis nvidia-smi
nvidia-smi: /usr/bin/nvidia-smi /usr/share/man/man1/nvidia-smi.1.gz
$ nvidia-installer -v |grep version
nvidia-installer: version 415.25
$ uname -a
Linux localhost.localdomain 4.19.13-300.fc29.x86_64 #1 SMP Sat Dec 29 22:54:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ lspci |grep -E "VGA|3D"
09:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
More infoirmation about nvidia docker:
$nvidia-docker version
NVIDIA Docker: 2.0.3
I have followed this guide for nvidia driver installation process.
@botalaszlo I have the same problem on Fedora 29 after dnf update today.
Running ldconfig in the container does make it work, the so file is found in /usr/local/cuda-9.2/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
Does this work for you?
docker run --runtime=nvidia --rm nvidia/cuda:10.0-base bash -c "ldconfig; nvidia-smi"
@andyneff Perfect! This works fine.
$ docker run --runtime=nvidia --rm nvidia/cuda:10.0-base bash -c "ldconfig; nvidia-smi"
Fri Jan 4 16:43:54 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25 Driver Version: 415.25 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:09:00.0 On | N/A |
| 35% 49C P8 8W / 180W | 254MiB / 8116MiB | 1% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
Maybe the documentation should be updated with your notice :)
@botalaszlo It's not a documentation bug, you shouldn't have to run ldconfig, this is a bug, and manually running ldconfig is just a workaround of the ldconfig cache not being right in the image.
I just found out today the hard way that this bug affects more than just nvidia stuff.
docker run --runtime=nvidia --rm nvidia/cuda:10.0-base ldconfig -p
0 libs found in cache `/etc/ld.so.cache'
This breaks anything in python that uses find_library (a lot), if not everything ld cache related.
@flx42 Any idea when the next release will be?
This should be fixed with the latest version of the libnvidia-container packages.
Closing, feel free to reopen if the bug persists.
Tested on Fedora 29, updated
After the update, confirmed fixed! Thanks @RenaudWasTaken
I'm still experiencing this bug, running ldconfig makes nvidia-smi work.
How can make it work without running ldconfig first?
Just supplying more (possibly useless) info. Still working on Fedora:
docker run --runtime=nvidia --rm nvidia/cuda@sha256:3cba5c5a8f37ba05b2710071907bd8da22ad1dc828025687b2435b1308a138ff nvidia-smi #that's today's digest id for tag 10.0-base
@edoardogiacomello What is your current version of ld ?
@edoardogiacomello What is your current version of ld ?
on the host I got: GNU ld (GNU Binutils for Ubuntu) 2.30
inside the docker container: GNU ld (GNU Binutils for Ubuntu) 2.26.1
Yep, same issue here with the latest version:
docker run --gpus all nvidia/cuda:10.1-base nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
docker run --gpus all nvidia/cuda:10.1-base ldconfig && nvidia-smi
Fri Aug 30 14:11:27 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 On | N/A |
| N/A 48C P8 5W / N/A | 223MiB / 7973MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1942 G /usr/lib/xorg/Xorg 18MiB |
| 0 2057 G /usr/bin/gnome-shell 57MiB |
| 0 2936 G /usr/lib/xorg/Xorg 69MiB |
| 0 3073 G /usr/bin/gnome-shell 76MiB |
+-----------------------------------------------------------------------------+
So yeah, still broken:
docker --version
Docker version 19.03.1, build 74b1e89
So yeah, still broken:
yes -- i'm also stumbling over this bug on debian testing:
~$ docker run --gpus all nvidia/cuda nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
vs.
~$ docker run --gpus all nvidia/cuda ldconfig && nvidia-smi
Wed Sep 18 17:13:19 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.40 Driver Version: 430.40 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... On | 00000000:01:00.0 Off | N/A |
| 33% 29C P8 6W / 180W | 1MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
~$ docker version
Client: Docker Engine - Community
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:29:29 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.12)
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:28:05 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
libnvidia-container1:amd64/buster 1.0.5-1 uptodate
the ldconfig workaround doesn't look acceptable...
how can we finally fix this long term issue?
Hi "nvidia",
Can you provide an eta on this?
It is really painful to run ldconfig on each command (and enable root access/sudo without password in the container).
Many thanks and kind regards.
Works for me on Ubuntu 18.04 and Debian 10.
Here's a run from scratch on Debian 10 (looks the same for me on Ubuntu as well), without ever manually doing ldconfig. I'm using nvidia-container-toolkit and have removed the old nvidia-docker2:
~/ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
Unable to find image 'nvidia/cuda:9.0-base' locally
9.0-base: Pulling from nvidia/cuda
f7277927d38a: Pull complete
8d3eac894db4: Pull complete
edf72af6d627: Pull complete
3e4f86211d23: Pull complete
d6e9603ff777: Pull complete
9454aa7cddfc: Pull complete
a296dc1cdef1: Pull complete
Digest: sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44
Status: Downloaded newer image for nvidia/cuda:9.0-base
Thu Oct 3 17:44:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 710 Off | 00000000:65:00.0 N/A | N/A |
| 50% 39C P0 N/A / N/A | 0MiB / 2001MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I recommend uninstalling and re-installing the driver and packages. it's possible your host system is in a strange state and it's impacting something in your setup.
@glennie
Sorry, we don't see this issue on our end. Without a better understanding of what problem you're specifically facing, we can't offer an ETA on a fix.
@glennie @mash-graz @Brainiarc7 do any of you get same result if you use nvjmayo's exact same sha?
docker run --runtime=nvidia --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
@andyneff
your cmd line produces this error message on my machine:
local@bonsai:~$ docker run --runtime=nvidia --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
using the --gpu option instead produces:
local@bonsai:~$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
and manually adding ldconfig finally works again:
local@bonsai:~$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 ldconfig && nvidia-smi
Mon Oct 7 19:10:11 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 107... On | 00000000:01:00.0 Off | N/A |
| 33% 26C P8 6W / 180W | 1MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
but i should perhaps mention, that i do not use the then nvida-drivers on this machine for the actual video output. i prefer the utilize the onboard intel chip for this purpose, because i'm otherwise not able to share the graphic card by PCIe--passthrough by qemu-kvm instances and mostly need the the nvidia card only for CUDA based GPGPU stuff. therefore the setup could slightly differ from other installations.
Hello,
Hello,
~/ docker run --runtime=nvidia --rm nvidia/cuda:9.0-base nvidia-smi
May be I'm missing something here... Why are you (@nvjmayo) using --runtime option?
I used --gpus all (as I've got docker 19.03.2).
Using the sha256 specified by @andyneff with --gpu all I still have the same issue:
[glennie@hestia` ~]$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
But, it works when I use ldconfig before:
```[glennie@hestia ~]$ docker run --gpus all --rm nvidia/cuda@sha256:1883759ad42016faba1e063d6d86d5875cecf21c420a5c1c20c27c41e46dae44 bash -c 'ldconfig && nvidia-smi'
Mon Oct 7 19:12:49 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce MX130 Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 N/A / N/A | 0MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+```
Kind regards,
May be I'm missing something here... Why are you (@nvjmayo) using --runtime option?
My mistake, I have multiple runtimes installed for a bunch of different environments (both for docker and podman). I should have pasted the canonical form. Sorry for the confusion.
But, it works when I use ldconfig before:
I'll ask the team to bump up the priority on fixing this. It's an issue of at what stage to run the container hooks. Automatically running ldconfig when needed is something we're looking into. When to do it, what mechanism to use to do it, and if we should stop a running container are all open questions for an implementation.
The best way to work around the issue right now is to run ldconfig on the container whenever you upgrade your host driver. Admittedly inconvenient.
Hello!
Can you give us a few more information?
uname -aldconfig --versionThanks!
Can you give us a few more information?
uname -a
~$ uname -a
Linux bonsai 5.2.0-3-amd64 #1 SMP Debian 5.2.17-1 (2019-09-26) x86_64 GNU/Linux
ldconfig --version
~$ sudo ldconfig --version
ldconfig (Debian GLIBC 2.29-2) 2.29
i hope, that helps!
btw.: i'm using debian testing as a rolling release solution, which isn't uncommon in case of GPGPU utilization for ML tasks, because the software in the stable debian branch is usually to much outdated for the requirements resp. fast progress in this field.
Can you try replacing "@/sbin/ldconfig" with "/sbin/ldconfig" in /etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.
is the reason resp. actual meaning of this "@"-syntax used in https://gitlab.com/nvidia/container-toolkit/toolkit/blob/master/config/config.toml.debian somewhere documented or explained?
Can you try replacing
"@/sbin/ldconfig"with"/sbin/ldconfig"in/etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.
Thank you, works for me. I am using Debian Testing (same as @mash-graz).
Is there gonna be an official update with a fix from Nvidia?
Thank you @lyon667. It worked for me as well after wasting many hours of my time. Why does this work @RenaudWasTaken?
is the reason resp. actual meaning of this "@"-syntax used in https://gitlab.com/nvidia/container-toolkit/toolkit/blob/master/config/config.toml.debian somewhere documented or explained?
I did not find any documentation but it seems to be processed here in nvc_ldcache_update in libnvidia-container.
Can you try replacing
"@/sbin/ldconfig"with"/sbin/ldconfig"in/etc/nvidia-container-runtime/config.toml? Not sure why but this helped in my case.
this solved it for me.
this solved it for me.
yes! -- this manual removal of the @-sign works for me as well.
i also don't understand, why this particular issue still isn't fixed in the released nvida-docker packages and still affects debian installations?
Most helpful comment
I have similar symptoms but I can run
nvidia-smiafter executingldconfiginside the container. I'm using driver version 410.73.