I have already gone through existing issues with similar errors but wasn't able to fix in my case.
I have nvidia-docker 2 installed on my host. I have been using the host for training some deep learning models and everything seems to be working fine so I believe the issue lies somewhere with docker/nvidia-docker.
Please inform me if any other info other than the one already attached is required.
Thanks!
Running
docker run --runtime=nvidia --rm nvidia/cuda:9.2-base nvidia-smi
results in
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.2 --pid=18037 /var/lib/docker/overlay2/e670a6ba48c5a5b17cbdf79d3425aa37697bf6d7dd6f5303fcfa723b6aa431ca/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
On my host:
nvidia-smi
Thu Dec 13 01:15:38 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.54 Driver Version: 396.54 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:04:00.0 Off | 0 |
| N/A 56C P0 61W / 149W | 7364MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:05:00.0 Off | 0 |
| N/A 44C P0 73W / 149W | 6220MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P40 On | 00000000:06:00.0 Off | 0 |
| N/A 34C P0 50W / 250W | 6381MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P40 On | 00000000:81:00.0 Off | 0 |
| N/A 35C P0 51W / 250W | 1651MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla P40 On | 00000000:82:00.0 Off | 0 |
| N/A 23C P8 10W / 250W | 10MiB / 22919MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7162 C ...airseq/nbla8/deps/miniconda3/bin/python 2630MiB |
| 0 7453 C ...airseq/nbla9/deps/miniconda3/bin/python 2954MiB |
sudo nvidia-container-cli -k -d /dev/tty info results in
`
-- WARNING, the following logs are for debugging purposes only --
I1212 18:16:59.412952 30205 nvc.c:281] initializing library context (version=1.0.0, build=881c88e2e5bb682c9bb14e68bd165cfb64563bb1)
I1212 18:16:59.413022 30205 nvc.c:255] using root /
I1212 18:16:59.413034 30205 nvc.c:256] using ldcache /etc/ld.so.cache
I1212 18:16:59.413044 30205 nvc.c:257] using unprivileged user 65534:65534
I1212 18:16:59.415662 30206 nvc.c:191] loading kernel module nvidia
I1212 18:16:59.416115 30206 nvc.c:203] loading kernel module nvidia_uvm
I1212 18:16:59.416397 30206 nvc.c:211] loading kernel module nvidia_modeset
I1212 18:16:59.416859 30207 driver.c:133] starting driver service
E1212 18:16:59.417155 30207 driver.c:197] could not start driver service: load library failed: libcuda.so.1: cannot open shared object file: permission denied
I1212 18:16:59.417322 30205 driver.c:233] driver service terminated successfully
nvidia-container-cli: initialization error: driver error: failed to process request
`
ldconfig -p | grep cuda results in:
libicudata.so.55 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.55
libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6) => /usr/lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so
libcuda.so (libc6) => /usr/lib/i386-linux-gnu/libcuda.so
Seeing the same exact issue with Ubuntu 16.04.5, driver 410.79 over here. Mine is an Azure P40 GPU virtual machine.
Hmm what are the permissions on libcuda.so ?
Thanks a lot!
@RenaudWasTaken thanks for taking this up.
On running: ls -l /usr/lib/i386-linux-gnu/ | grep cuda
lrwxrwxrwx 1 root root 12 Agu 22 01:19 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Agu 22 01:19 libcuda.so.1 -> libcuda.so.396.54
-rw-r--r-- 1 root root 12851992 Agu 15 13:01 libcuda.so.396.54
Output for: ls -l /usr/lib/x86_64-linux-gnu/ | grep cuda
lrwxrwxrwx 1 root root 12 Agu 22 01:19 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Agu 22 01:19 libcuda.so.1 -> libcuda.so.396.54
-rw-r--r-- 1 root root 14074232 Agu 15 13:17 libcuda.so.396.54
lrwxrwxrwx 1 root root 18 Mar 27 2018 libicudata.so.55 -> libicudata.so.55.1
-rw-r--r-- 1 root root 25913104 Mar 27 2018 libicudata.so.55.1
Let me know if any other information on my end would help.
Hmm, this is pretty surprising...
Can you run a CUDA program (e.g CUDA samples) outside of a container?
Yes, I have trained a few PyTorch models using GPU. So it seems to be working fine outside the container.
After further inspection, while my issue exhibited similar symptoms to reachtarunhere's issue, I believe they are separate problems.
My issue was that when installing aptitude package nvidia-410 from the nvidia aptitude repository, the cuda dependency package which contained libcuda.so.1 was unavailable to nodes in our Azure region (possibly due to a node in the nvidia CDN which was missing packages?), and driver installation would be missing necessary files.
We've since moved to downloading .deb files directly and installing the driver from those so that we no longer need to rely on the nvidia CDN to contain all appropriate packages. This has resolved the issue for our team.
Hello sorry for the slow replies during the holiday season.
What is your docker version and what is your host running on?
The output for docker --version
Docker version 18.09.0, build 4d60db4
Output for lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
Thanks!
This looks very much like a driver installation issue that only manifests itself when you try to use it with different permissions (i.e in a container settings).
Reinstalling (uninstall it then install a new version) the driver should fix your permissions issues, sorry I didn't diagnose this earlier.
Feel free to re-open if it doesn't fix your issue!
Anyone still have the same issue?
Most helpful comment
Anyone still have the same issue?