Hi,
I'm getting a failure on trying to load libcuda.so.1, but my understanding is that CUDA doesn't have to be installed on the host machine, right?
I'm on Fedora 26, with bumblebee installed but optirun shouldn't be needed, right?
[sztamas@nomad ~]$ cat /proc/acpi/bbswitch
0000:01:00.0 ON
[sztamas@nomad ~]$ nvidia-smi
Sat Nov 18 22:59:48 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 Off | 00000000:01:00.0 Off | N/A |
| N/A 37C P0 N/A / N/A | 0MiB / 4041MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
md5-35ae7ed1800d3dd4d428f1c29b64f8dd
[sztamas@nomad ~]$ docker run --runtime=nvidia --rm nvidia/cuda find / -name nvidia-smicontainer_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=19311 /var/lib/docker/overlay2/248624b66696970549d54634da9cba9a6c6041b9d5d587b2d2cfa6c698a70a7e/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\""
docker: Error response from daemon: oci runtime error: container_linux.go:265: starting container process caused "process_linux.go:368: container init caused \"process_linux.go:351: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=19311 /var/lib/docker/overlay2/248624b66696970549d54634da9cba9a6c6041b9d5d587b2d2cfa6c698a70a7e/merged]\\\\nnvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory\\\\n\\\"\"".
md5-4c106e56a9d23b5fe4810c4a803b9bf0
[sztamas@nomad ~]$ nvidia-container-cli --debug=/dev/stdout list --compute
-- WARNING, the following logs are for debugging purposes only --
I1118 21:05:30.464748 19428 nvc.c:250] initializing library context (version=1.0.0, build=ec15c7233bd2de821ad5127cb0de6b52d9d2083c)
I1118 21:05:30.464846 19428 nvc.c:225] using ldcache /etc/ld.so.cache
I1118 21:05:30.464857 19428 nvc.c:226] using unprivileged user 1000:1000
nvidia-container-cli: initialization error: load library failed: libcuda.so.1: cannot open shared object file: no such file or directory
md5-35ae7ed1800d3dd4d428f1c29b64f8dd
[sztamas@nomad ~]$ ldconfig -p | grep cuda
libicudata.so.57 (libc6,x86-64) => /lib64/libicudata.so.57
libcuda.so.1 (libc6) => /lib/libcuda.so.1
libcuda.so (libc6) => /lib/libcuda.so
[sztamas@nomad ~]$ uname -a
Linux nomad 4.13.12-200.fc26.x86_64 #1 SMP Wed Nov 8 16:47:26 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[sztamas@nomad ~]$ cat /etc/fedora-release
Fedora release 26 (Twenty Six)
[sztamas@nomad ~]$ docker -v
Docker version 17.09.0-ce, build afdb6d4
[sztamas@nomad ~]$ dnf list installed | grep nvidia-docker2
nvidia-docker2.noarch 2.0.1-1.docker17.09.0.ce @nvidia-docker
Any ideas what could be wrong?
Many Thanks.
I'm getting a failure on trying to load libcuda.so.1, but my understanding is that CUDA doesn't have to be installed on the host machine, right?
That's right, but libcuda.so.1 comes from the driver, not the CUDA toolkit. (not to be confused with libcudart).
Please share the output of:
ldconfig -p | grep cuda
Thanks for the quick answer!
It was in the original post already, maybe I've included too much :)
[sztamas@nomad ~]$ ldconfig -p | grep cuda
libicudata.so.57 (libc6,x86-64) => /lib64/libicudata.so.57
libcuda.so.1 (libc6) => /lib/libcuda.so.1
libcuda.so (libc6) => /lib/libcuda.so
Eh, you have libcuda.so.1 but 32-bit. How did you install the NVIDIA driver?
I followed the Fedora Bumblebee Wiki page https://fedoraproject.org/wiki/Bumblebee
# dnf -y --nogpgcheck install http://install.linux.ncsu.edu/pub/yum/itecs/public/bumblebee/fedora$(rpm -E %fedora)/noarch/bumblebee-release-1.2-1.noarch.rpm
Used the managed NVidia repo:
# dnf -y --nogpgcheck install http://install.linux.ncsu.edu/pub/yum/itecs/public/bumblebee-nonfree/fedora$(rpm -E %fedora)/noarch/bumblebee-nonfree-release-1.2-1.noarch.rpm
and
# dnf install bumblebee-nvidia bbswitch-dkms VirtualGL.x86_64 VirtualGL.i686 primus.x86_64 primus.i686 kernel-devel
Do you have another libcuda.so.1 installed somewhere else on your system?
I would recommend using the official repository and install the cuda-drivers package:
http://developer.download.nvidia.com/compute/cuda/repos/fedora25/x86_64/
It's for Fedora 25 though, so the second best choice is to use our installer scripts:
http://www.nvidia.com/object/unix.html
Check also this (old) issue for bumblebee/bbswitch:
https://github.com/NVIDIA/nvidia-docker/issues/16
For Fedora 25+, just use the negativo repository and save yourself a lot of hair tearing getting nvidia and cuda to work in Fedora 25+ on optimus systems.
https://negativo17.org/nvidia-driver-improvements-for-fedora-25/
As a bonus, it comes with a gcc compatiblilty package so you can use an older compiler to build cuda code.
Not an issue with nvidia-docker.
This is due to your broken nvidia driver installation. For Ubuntu 16.04 release, you can check the fix here: https://zhuanlan.zhihu.com/p/37519492
@flx42: the current ubuntu 16.04 package nvidia-384 384.130-0ubuntu0.16.04.1 does not provide the libcuda.so.1 (https://packages.ubuntu.com/xenial-updates/amd64/nvidia-384/filelist).
It is provided by libcuda1-384 (https://packages.ubuntu.com/xenial-updates/amd64/libcuda1-384/filelist), which is only a recommended dependency of nvidia-384.
This means that when installing with apt-get install -y --no-recommends nvidia-384, libcuda.so.1 is missing, resulting in the initial error.
I'm not sure if the libcuda.so.1 is the full cuda lib or just the driver part that you talked about in your first comment on this issue. I don't know how to improve the situation: is it an ubuntu packaging bug? can nvidia-docker somehow install libcuda1-384 by dependency when in such scenario (driver installed by ubuntu package) (I'm not very optimistic about that possibility)?
In the meantime, this issue could be documented (either in the main README or in the wiki FAQ) with something like:
On ubuntu 16.04: when installing the nvidia driver with the ubuntu package
nvidia-384you need to install its recommended dependencylibcuda1-384too, otherwise you will get this error:
$ docker run -it --rm --runtime=nvidia nvidia/cuda:8.0-runtime sh
docker: Error response from daemon: OCI runtime create failed: container_linux.go:348: starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods --debug=/var/log/nvidia-container-runtime-hook.log configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=8.0 --pid=5801 /var/lib/docker/aufs/mnt/1b3b6a154308bda7863a5438b68f68d3c0827cb20f67cb2394a2563de9b791a6]\\\\nnvidia-container-cli: initialization error: driver error: failed to process request\\\\n\\\"\"": unknown.
For ubuntu 18.04 they changed the packaging and it seems to be OK:
https://packages.ubuntu.com/bionic/nvidia-driver-390 depends on https://packages.ubuntu.com/bionic/amd64/libnvidia-compute-390 which ships libcuda.so.1
We assume that the CUDA driver is properly installed on the host machine before installing nvidia-docker.
You should test your system with a small CUDA sample outside of a container, if you don't have libcuda.so, it will fail. And it if fails outside of containers, it will probably fail with containers.
There are many ways to install our drivers, so we want to keep that to the official CUDA documentation.
@flx42 I'm confused: the initial comment said:
my understanding is that CUDA doesn't have to be installed on the host machine, right?
.. and you confirmed on your first reply.
Now you say there is an implicit requirement that CUDA is properly installed on the host machine.
My confusion probably comes from CUDA vs CUDA toolkit, but in any case the nvidia-docker README doesn't talk about any host requirement regarding CUDA: it only requires installing the NVIDIA driver.
If NVIDIA considers the libcuda.so.1 to be part of the NVIDIA driver, then it's an ubuntu packaging error. It probably won't be fixed as they can't break such things on 16.04, and they seem to do the right thing on ubuntu 18.04; but this should probably be documented somewhere.
You suggest this should be part of the CUDA documentation, but:
nvidia-docker.Sorry, CUDA driver is properly installed on the host machine. Edited my answer
But I confirm that libcuda.so.1 is part of the NVIDIA driver. And it's not a packaging error from Ubuntu, it makes sense to want to install a subset of the NVIDIA driver if you know you aren't going to use everything.
AFAK it only talks about nvidia drivers provided by the cuda apt repository or by the .run script, not the ubuntu package (which is understandable)
The ubuntu package and the CUDA repo package are very similar, only the version might be different.
I had installed nvidia-415 driver.
I checked and libcuda.so.1 was missing
ldconfig -p | grep libcuda
libcudart.so.9.0 (libc6,x86-64) => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so.9.0
libcudart.so (libc6,x86-64) => /usr/local/cuda-9.0/targets/x86_64-linux/lib/libcudart.so
I fixed the problem by installing libcuda from apt-get
sudo apt-get install libcuda1-415
Most helpful comment
I had installed nvidia-415 driver.
I checked and libcuda.so.1 was missing
ldconfig -p | grep libcudaI fixed the problem by installing libcuda from apt-get
sudo apt-get install libcuda1-415