Nvidia-docker: Couldn't find libnvidia-ml.so library in your system

Created on 14 Nov 2018 · 8Comments · Source: NVIDIA/nvidia-docker

1. Issue or feature description

Sorry if this is a duplicate, but I have been following circular links marking the same issue as a duplicate without actually finding a solution. Warning I'm new to this so the more I patch the more worried I get that I have made a mess and will have to start again from scratch. Right now I have the nvidia docker and container RPM's installed corresponding to my docker version 1.13, so pretty clean and pristine from the RPM installation.

There appears to be a conflict between the nvidia- installed docker.daemon and the default RHEL/CentOs daemon.json, apparently it doesn't like command line specification and daemon.json configuration. It looks like it could be very simple to fix.

In short:

systemctl status docker
.....

/etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration
...
runtimes: (from flag: [oci], from file: map[nvidia:map[path:/usr/bin/nvidia-container-runtime runtimeArgs:[]]])

2. Steps to reproduce the issue

sudo systemctl restart docker

3. Information to attach (optional if deemed irrelevant)

I tried to clear daemon.json to contain only {},
Then docker runs fine but with default runtime oci and cannot start nvidia GPU images

systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Wed 2018-11-14 14:06:24 GMT; 22min ago
Docs: http://docs.docker.com
Main PID: 2822 (dockerd-current)
Tasks: 18 (limit: 8192)
Memory: 78.2M
CGroup: /system.slice/docker.service
└─2822 /usr/bin/dockerd-current --add-runtime oci=/usr/libexec/docker/docker-runc-current --default-runtime=oci --authorization-plugin=rhel-push-plugin --containerd /run/containerd.>

Source

clythersHackers

Most helpful comment

1.docker run -it --rm --runtime=nvidia tensorflow/tensorflow:latest-gpu-py3 bash
2.ldconfig
3.nvidia-smi
//then open a new console write:
4.docker ps
//find your container id then save them to disk:
5.docker commit 4f0d5870605f tensorflow/tensorflow:gpu_fixed
//then you can use the new container which you saved and in this container you can excute 'nvidia-smi' and use the tensorflow-gpu

ljh2057 on 8 Jan 2019

👍2

All 8 comments

Looks like you are running Red Hat's fork of docker, so you should follow these instructions instead:
https://github.com/NVIDIA/nvidia-docker#centos-7-docker-rhel-7475-docker
You won't install the nvidia-docker2 package, so no daemon.json conflict.

flx42 on 14 Nov 2018

Thanks, so I carefully removed what I believe is associated with docker2
nvidia-docker2-2.0.3-1.docker1.13.1.noarch
nvidia-container-runtime-hook-1.3.0-1.x86_64
nvidia-container-runtime-2.0.0-1.docker1.13.1.x86_64

Then ran the second procedure in your link... up to:

yum install -y nvidia-container-runtime-hook

This went OK, dockerd service starts OK, but on running the test...

Get errors: (was I wrong to remove the container RPMs?)

docker run --rm nvidia/cuda:9.0-base nvidia-smi
container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\""
/usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "process_linux.go:339: running prestart hook 1 caused \"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=4517 /var/lib/docker/overlay2/6196a356e3cb283ade76732913a84ce62583e3b0dd4020bcf60042cfbc5b249e/merged]\nnvidia-container-cli: ldcache error: process /sbin/ldconfig failed with error code: 1\n\"".

clythersHackers on 14 Nov 2018

Hello!

Do you mind giving a bit more information so we can help you debug this:

[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg
[ ] Driver information from nvidia-smi -a
[ ] Docker version from docker version
[ ] NVIDIA packages version from dpkg -l '*nvidia*' _or_ rpm -qa '*nvidia*'
[ ] NVIDIA container library version from nvidia-container-cli -V
[ ] NVIDIA container library logs (see troubleshooting)
[X] Docker command, image and tag used: docker run --rm nvidia/cuda:9.0-base nvidia-smi

RenaudWasTaken on 23 Nov 2018

Hi, thanks for your response, note this is the situation now after following the previous instructions.

uname -a:

Linux ccs1 4.18.17-300.fc29.x86_64 #1 SMP Mon Nov 5 17:56:16 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Running fedora-core 29. Where needed, set environment variables to corresponding Centos

Possibly relevant output from dmesg:

[ 15.783041] nvidia: loading out-of-tree module taints kernel.
[ 15.783053] nvidia: module license 'NVIDIA' taints kernel.
[ 15.783054] Disabling lock debugging due to kernel taint
[ 15.803925] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 15.816497] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[ 15.817219] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[ 16.020085] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 410.72 Wed Oct 17 20:08:45 CDT 2018 (using threaded interrupts)
[ 16.074132] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 237
[ 16.112723] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 410.72 Wed Oct 17 20:07:15 CDT 2018
[ 16.125118] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[ 16.125121] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Sun Nov 25 16:16:52 2018
Driver Version : 410.72
CUDA Version : 10.0

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : GeForce GTX 1050 Ti
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-f281ece7-f156-c934-7dd5-2d33d9339d43
Minor Number : 0
VBIOS Version : 86.07.39.00.30
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1C8210DE
Bus Id : 00000000:01:00.0
Sub System Id : 0xA45419DA
GPU Link Info
PCIe Generation
Max : 1
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 265000 KB/s
Fan Speed : 45 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 4039 MiB
Used : 152 MiB
Free : 3887 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 6 MiB
Free : 250 MiB
Compute Mode : Default
Utilization
Gpu : 7 %
Memory : 5 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 29 C
GPU Shutdown Temp : 102 C
GPU Slowdown Temp : 99 C
GPU Max Operating Temp : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : N/A
Power Limit : 75.00 W
Default Power Limit : 75.00 W
Enforced Power Limit : 75.00 W
Min Power Limit : 52.50 W
Max Power Limit : 75.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 1923 MHz
SM : 1923 MHz
Memory : 3504 MHz
Video : 1708 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Process ID : 1064
Type : G
Name : /usr/libexec/Xorg
Used GPU Memory : 80 MiB
Process ID : 1657
Type : G
Name : /usr/bin/kwin_x11
Used GPU Memory : 22 MiB
Process ID : 1664
Type : G
Name : /usr/bin/krunner
Used GPU Memory : 1 MiB
Process ID : 1667
Type : G
Name : /usr/bin/plasmashell
Used GPU Memory : 44 MiB

docker version

Client:
Version: 1.13.1
API version: 1.26
Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64
Go version: go1.11beta2
Git commit: accfe55-unsupported
Built: Wed Jul 25 18:54:07 2018
OS/Arch: linux/amd64

Server:
Version: 1.13.1
API version: 1.26 (minimum version 1.12)
Package version: docker-1.13.1-62.git9cb56fd.fc29.x86_64
Go version: go1.11beta2
Git commit: accfe55-unsupported
Built: Wed Jul 25 18:54:07 2018
OS/Arch: linux/amd64
Experimental: false

rpm -qa 'nvidia'

nvidia-xconfig-410.72-1.fc27.x86_64
nvidia-driver-NvFBCOpenGL-410.72-1.fc27.x86_64
nvidia-libXNVCtrl-devel-410.72-1.fc27.x86_64
nvidia-container-runtime-hook-1.4.0-2.x86_64
kmod-nvidia-4.19.3-300.fc29.x86_64-410.72-1.fc29.x86_64
nvidia-driver-cuda-libs-410.72-1.fc27.x86_64
nvidia-driver-libs-410.72-1.fc27.x86_64
akmod-nvidia-410.72-1.fc27.x86_64
nvidia-settings-410.72-1.fc27.x86_64
nvidia-libXNVCtrl-410.72-1.fc27.x86_64
nvidia-driver-NVML-410.72-1.fc27.x86_64
libnvidia-container-tools-1.0.0-1.x86_64
nvidia-driver-devel-410.72-1.fc27.x86_64
nvidia-driver-cuda-410.72-1.fc27.x86_64
kmod-nvidia-4.18.17-300.fc29.x86_64-410.72-1.fc29.x86_64
nvidia-driver-410.72-1.fc27.x86_64
nvidia-persistenced-410.72-1.fc27.x86_64
nvidia-modprobe-410.72-1.fc27.x86_64
libnvidia-container1-1.0.0-1.x86_64

nvidia-container-cli -V

version: 1.0.0
build date: 2018-09-20T20:25+0000
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

nvidia-container-cli -V
version: 1.0.0
build date: 2018-09-20T20:25+0000
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-28)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

docker run --rm nvidia/cuda:9.0-base nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

I tried looking for libnvidia-ml.so
sudo find / -name libnvidia-ml.so
/usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/var/lib/docker/overlay2/53747a6cff62ecaa574033dc954eaf3b2877ad372dafa63687808ba82b4d259a/diff/usr/local/cuda-10.0/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

Addding the stubs dir to end of my PATH/LD_LIBRARY_PATH doesn't help, but in any case I'm not sure it would help in the environment outside the docker.

clythersHackers on 25 Nov 2018

I'm guessing the problem relates to CUDA 10 being installed while the docker command is using 9.0, which ultimately was my reason for using a dock in the first place. I didn't want the trouble of downgrading CUDA for tensorflow so wanted to use the ready-made ngc dock

clythersHackers on 25 Nov 2018

Hello!

Sorry for the late reply.
CUDA is backwards compatible. Here it seems like we aren't finding libnvidia-ml.so on your system.

Given the symptoms, you are probably stumbling in the issue that was fixed by this commit:
NVIDIA/libnvidia-container@deccb28

You would need to wait for the next release of the library (around mid-jan) or build the library by hand.

RenaudWasTaken on 20 Dec 2018

👍1