When running a container which relies on Vulkan, I receive the following error: The NVIDIA driver was unable to open 'libnvidia-glvkspirv.so.460.32.03'. This library is required at run time. See related issue: https://github.com/adamrehn/pixel-streaming-linux/issues/43
I have a containerised UE4 application. When I run this on the 'AWS Deep Learning AMI (Ubuntu 18.04)' on a g4dn.xlarge EC2 instance, it runs fine. The NVIDIA Container Toolkit correctly passes the driver into the container. However, when I use the new Docker Compose CLI to launch the container, it doesn't appear to be passed in correctly. Other aspects of the NVIDIA driver work fine, the application can initialise a CUDA context.
When I run docker compose convert, I can confirm that CloudFormation stack requests the exact same instance type and AMI as I set up manually. On my testing instance, when I visit /usr/lib/x86_64-linux-gnu/, the driver is there. When I visit /usr/lib/x86_64-linux-gnu/ inside the container on my testing instance, the driver is there. Unfortunately, I am not sure how I can check this when run through CloudFormation and the ECS agent.
Application runs correctly, as on a manually created instance and container.
Application crashes with the following error: The NVIDIA driver was unable to open 'libnvidia-glvkspirv.so.460.32.03'. This library is required at run time.
Hi @luc122c , Is it possible for you share the steps to repro the error on our end?
Hi @shubham2892. To debug the issue, I'm using the following command to search for the Vulkan drivers: ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk. I'm testing on the following two setups:
The NVIDIA System Management Interface (nvidia-smi) command line utility shows the following output for the two machines:
$ nvidia-smi
Wed Apr 7 16:59:41 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro K1200 On | 00000000:01:00.0 Off | N/A |
| 39% 34C P8 1W / 35W | 13MiB / 4043MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 891 G /usr/lib/xorg/Xorg 8MiB |
| 0 N/A N/A 1134 G /usr/bin/gnome-shell 1MiB |
+-----------------------------------------------------------------------------+
$ nvidia-smi
Wed Apr 7 15:59:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Now, when I run the command on the host machines, I get the following output:
$ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk
libnvidia-glvkspirv.so.460.32.03
$ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk
libnvidia-glvkspirv.so.450.80.02
So the Vulkan driver is available in both cases. The next step is to see if they are available within containers on these machines. To do this, consider the following docker compose file:
version: '3.8'
services:
TestServer:
image: adamrehn/ue4-runtime:18.04-cudagl10.2
command: "sh -c 'ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk'"
environment:
- NVIDIA_DRIVER_CAPABILITIES=all
deploy:
resources:
reservations:
memory: 2Gb
cpus: "1"
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
This will execute the command in a container from the imageadamrehn/ue4-runtime:18.04-cudagl10.2 which is the one i'm using in my application. I'm using the new Docker Compose CLI to run this on both machines. When I do, I get the following output:
Attaching to data_TestServer_1
TestServer_1 | libnvidia-glvkspirv.so.460.32.03
data_TestServer_1 exited with code 0
Attaching to TestServer_1
TestServer_1 | libnvidia-glvkspirv.so.450.80.02
TestServer_1 exited with code 0
The next step is to use the Docker Compose ECS integration to launch this on AWS ECS. After configuring docker to use an ECS context, I run docker compose convert. I see the following lines in the Launch Configuration section for the EC2 instance:
ImageId: ami-082e298f790f88621
InstanceType: g4dn.xlarge
This tells me that ECS is launching an instance of amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs which is an Amazon Linux 2 image. Therefore, drivers will not be located in /usr/lib/x86_64-linux-gnu/ on the host machine, however the runtime image I'm using in docker is still Ubuntu. When I boot up an instance with this AMI and run the docker compose script above, I get an error. When I remove the grep and simply list the drivers, I can see that the following NVIDIA drivers have been loaded into the container:
TestServer_1 | libnvidia-cfg.so.1
TestServer_1 | libnvidia-cfg.so.460.32.03
TestServer_1 | libnvidia-compiler.so.460.32.03
TestServer_1 | libnvidia-eglcore.so.460.32.03
TestServer_1 | libnvidia-encode.so.1
TestServer_1 | libnvidia-encode.so.460.32.03
TestServer_1 | libnvidia-fatbinaryloader.so.440.118.02
TestServer_1 | libnvidia-fbc.so.1
TestServer_1 | libnvidia-fbc.so.460.32.03
TestServer_1 | libnvidia-glcore.so.460.32.03
TestServer_1 | libnvidia-glsi.so.460.32.03
TestServer_1 | libnvidia-ifr.so.1
TestServer_1 | libnvidia-ifr.so.460.32.03
TestServer_1 | libnvidia-ml.so.1
TestServer_1 | libnvidia-ml.so.460.32.03
TestServer_1 | libnvidia-opencl.so.1
TestServer_1 | libnvidia-opencl.so.460.32.03
TestServer_1 | libnvidia-ptxjitcompiler.so.1
TestServer_1 | libnvidia-ptxjitcompiler.so.440.118.02
TestServer_1 | libnvidia-ptxjitcompiler.so.460.32.03
TestServer_1 | libnvidia-tls.so.460.32.03
The Vulkan driver is missing. I don't know Amazon Linux 2 well enough to know if this is an issue with the AMI, or the NVIDIA container toolkit, or ECS or the ECS Agent. I hope I have explained my issue sufficiently. If you can suggest any further troubleshooting steps, I'd be more than happy to try them. If this is not the right place and you think the problem is somewhere else, please kindly point me in the right direction. Many thanks.
Hi @luc122c,
I launched an instance of amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs and ran nvidia-container-cli -k -d /dev/tty info to get nvidia-container info
[root@ip-172-31-39-89 bin]# nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0408 21:41:11.105640 21255 nvc.c:276] initializing library context (version=1.0.0)
I0408 21:41:11.105706 21255 nvc.c:250] using root /
I0408 21:41:11.105721 21255 nvc.c:251] using ldcache /etc/ld.so.cache
I0408 21:41:11.105732 21255 nvc.c:252] using unprivileged user 65534:65534
I0408 21:41:11.106893 21256 nvc.c:186] loading kernel module nvidia
I0408 21:41:11.107067 21256 nvc.c:198] loading kernel module nvidia_uvm
I0408 21:41:11.107191 21256 nvc.c:206] loading kernel module nvidia_modeset
I0408 21:41:11.107447 21257 driver.c:133] starting driver service
I0408 21:41:11.661803 21255 nvc_info.c:433] requesting driver information with ''
I0408 21:41:11.662119 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-tls.so.460.32.03
I0408 21:41:11.662173 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.32.03
I0408 21:41:11.662225 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-opencl.so.460.32.03
I0408 21:41:11.662259 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ml.so.460.32.03
I0408 21:41:11.662307 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ifr.so.460.32.03
I0408 21:41:11.662352 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-glsi.so.460.32.03
I0408 21:41:11.662387 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-glcore.so.460.32.03
I0408 21:41:11.662420 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-fbc.so.460.32.03
I0408 21:41:11.662466 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-encode.so.460.32.03
I0408 21:41:11.662511 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-eglcore.so.460.32.03
I0408 21:41:11.662545 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-compiler.so.460.32.03
I0408 21:41:11.662578 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-cfg.so.460.32.03
I0408 21:41:11.662645 21255 nvc_info.c:147] selecting /usr/lib64/libnvcuvid.so.460.32.03
I0408 21:41:11.662744 21255 nvc_info.c:147] selecting /usr/lib64/libcuda.so.460.32.03
I0408 21:41:11.662814 21255 nvc_info.c:147] selecting /usr/lib64/libGLX_nvidia.so.460.32.03
I0408 21:41:11.662848 21255 nvc_info.c:147] selecting /usr/lib64/libGLESv2_nvidia.so.460.32.03
I0408 21:41:11.662880 21255 nvc_info.c:147] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.32.03
I0408 21:41:11.662920 21255 nvc_info.c:147] selecting /usr/lib64/libEGL_nvidia.so.460.32.03
W0408 21:41:11.662945 21255 nvc_info.c:298] missing library libnvidia-fatbinaryloader.so
and I see that libnvidia-glvkspirv.so.460.32.03 is not being loaded. Checking if this issue is from ECS AMI side.
Hi @luc122c, so the ECS optimized AMIs are installed with old version of libnvidia-container v1.0.0 which does not load Vulkan driver. The workaround would be to install the the latest version nvidia-container-cli from RHEL-based distributions using user data of AMI to mitigate this issue.
nvidia-container-cli update using following steps:amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs isntance (ami-03b6cac08fe853ac8) with g4dn.xlarge instance type in us-west-2
DIST=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.repo | \
sudo tee /etc/yum.repos.d/libnvidia-container.repo
yum remove libnvidia-containernvidia-container-cli
yum install libnvidia-container1
yum install libnvidia-container-tools --disablerepo=\* --enablerepo=libnvidia-container
yum install nvidia-container-runtime-hook
docker run -it --env NVIDIA_DRIVER_CAPABILITIES=all --gpus=all adamrehn/ue4-runtime:18.04-cudagl10.2 sh
$ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk
libnvidia-glvkspirv.so.460.32.03
Closing this issue as this is not a bug from ECS Agent. Please feel free re-open the issue if you have more questions.
Hi @mythri-garaga, thank you very much for your investigation and for providing a workaround! I'll update my docker compose file.
This may not be your area but do you know if this will be fixed in an upcoming AMI release? Is there somewhere I can open a ticket or subscribe to updates?
@luc122c We are tracking this internally and we will update here when there is a libnvidia-container package update on ECS optimized AMIs.