Nvidia-docker: Driver Issues on nvidia-docker even with properly installed driver

Created on 12 Dec 2018 · 10Comments · Source: NVIDIA/nvidia-docker

1. Issue or feature description

I have already gone through existing issues with similar errors but wasn't able to fix in my case.

I have nvidia-docker 2 installed on my host. I have been using the host for training some deep learning models and everything seems to be working fine so I believe the issue lies somewhere with docker/nvidia-docker.

Please inform me if any other info other than the one already attached is required.

Thanks!

2. Steps to reproduce the issue

Running
docker run --runtime=nvidia --rm nvidia/cuda:9.2-base nvidia-smi

results in

docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.2 --pid=18037 /var/lib/docker/overlay2/e670a6ba48c5a5b17cbdf79d3425aa37697bf6d7dd6f5303fcfa723b6aa431ca/merged]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.

3. Information to attach (optional if deemed irrelevant)

On my host:

nvidia-smi

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 7162 C ...airseq/nbla8/deps/miniconda3/bin/python 2630MiB |
| 0 7453 C ...airseq/nbla9/deps/miniconda3/bin/python 2954MiB |

sudo nvidia-container-cli -k -d /dev/tty info results in

`
-- WARNING, the following logs are for debugging purposes only --

I1212 18:16:59.412952 30205 nvc.c:281] initializing library context (version=1.0.0, build=881c88e2e5bb682c9bb14e68bd165cfb64563bb1)
I1212 18:16:59.413022 30205 nvc.c:255] using root /
I1212 18:16:59.413034 30205 nvc.c:256] using ldcache /etc/ld.so.cache
I1212 18:16:59.413044 30205 nvc.c:257] using unprivileged user 65534:65534
I1212 18:16:59.415662 30206 nvc.c:191] loading kernel module nvidia
I1212 18:16:59.416115 30206 nvc.c:203] loading kernel module nvidia_uvm
I1212 18:16:59.416397 30206 nvc.c:211] loading kernel module nvidia_modeset
I1212 18:16:59.416859 30207 driver.c:133] starting driver service
E1212 18:16:59.417155 30207 driver.c:197] could not start driver service: load library failed: libcuda.so.1: cannot open shared object file: permission denied
I1212 18:16:59.417322 30205 driver.c:233] driver service terminated successfully
nvidia-container-cli: initialization error: driver error: failed to process request
`

ldconfig -p | grep cuda results in:

libicudata.so.55 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libicudata.so.55 libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1 libcuda.so.1 (libc6) => /usr/lib/i386-linux-gnu/libcuda.so.1 libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so libcuda.so (libc6) => /usr/lib/i386-linux-gnu/libcuda.so

Source

reachtarunhere

👍2 👀1

Most helpful comment

Anyone still have the same issue?

SehgalDivij on 3 May 2019

👍14

All 10 comments

Seeing the same exact issue with Ubuntu 16.04.5, driver 410.79 over here. Mine is an Azure P40 GPU virtual machine.

bhuntley on 12 Dec 2018

Hmm what are the permissions on libcuda.so ?
Thanks a lot!

RenaudWasTaken on 21 Dec 2018

@RenaudWasTaken thanks for taking this up.

On running: ls -l /usr/lib/i386-linux-gnu/ | grep cuda

lrwxrwxrwx 1 root root 12 Agu 22 01:19 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Agu 22 01:19 libcuda.so.1 -> libcuda.so.396.54
-rw-r--r-- 1 root root 12851992 Agu 15 13:01 libcuda.so.396.54

Output for: ls -l /usr/lib/x86_64-linux-gnu/ | grep cuda

lrwxrwxrwx 1 root root 12 Agu 22 01:19 libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Agu 22 01:19 libcuda.so.1 -> libcuda.so.396.54
-rw-r--r-- 1 root root 14074232 Agu 15 13:17 libcuda.so.396.54
lrwxrwxrwx 1 root root 18 Mar 27 2018 libicudata.so.55 -> libicudata.so.55.1
-rw-r--r-- 1 root root 25913104 Mar 27 2018 libicudata.so.55.1

Let me know if any other information on my end would help.

reachtarunhere on 26 Dec 2018

Hmm, this is pretty surprising...
Can you run a CUDA program (e.g CUDA samples) outside of a container?

RenaudWasTaken on 30 Dec 2018

Yes, I have trained a few PyTorch models using GPU. So it seems to be working fine outside the container.

reachtarunhere on 31 Dec 2018

After further inspection, while my issue exhibited similar symptoms to reachtarunhere's issue, I believe they are separate problems.

My issue was that when installing aptitude package nvidia-410 from the nvidia aptitude repository, the cuda dependency package which contained libcuda.so.1 was unavailable to nodes in our Azure region (possibly due to a node in the nvidia CDN which was missing packages?), and driver installation would be missing necessary files.

We've since moved to downloading .deb files directly and installing the driver from those so that we no longer need to rely on the nvidia CDN to contain all appropriate packages. This has resolved the issue for our team.

bhuntley on 4 Jan 2019

Hello sorry for the slow replies during the holiday season.

What is your docker version and what is your host running on?

RenaudWasTaken on 7 Jan 2019

The output for docker --version

Docker version 18.09.0, build 4d60db4

Output for lsb_release -a

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial

Thanks!

reachtarunhere on 9 Jan 2019

This looks very much like a driver installation issue that only manifests itself when you try to use it with different permissions (i.e in a container settings).

Reinstalling (uninstall it then install a new version) the driver should fix your permissions issues, sorry I didn't diagnose this earlier.
Feel free to re-open if it doesn't fix your issue!

RenaudWasTaken on 9 Jan 2019

👎2

Anyone still have the same issue?

SehgalDivij on 3 May 2019

👍14

Was this page helpful?

0 / 5 - 0 ratings