Amazon-ecs-agent: Vulkan drivers not accessible to ECS container on EC2

Created on 25 Mar 2021  路  7Comments  路  Source: aws/amazon-ecs-agent

Summary

When running a container which relies on Vulkan, I receive the following error: The NVIDIA driver was unable to open 'libnvidia-glvkspirv.so.460.32.03'. This library is required at run time. See related issue: https://github.com/adamrehn/pixel-streaming-linux/issues/43

Description

I have a containerised UE4 application. When I run this on the 'AWS Deep Learning AMI (Ubuntu 18.04)' on a g4dn.xlarge EC2 instance, it runs fine. The NVIDIA Container Toolkit correctly passes the driver into the container. However, when I use the new Docker Compose CLI to launch the container, it doesn't appear to be passed in correctly. Other aspects of the NVIDIA driver work fine, the application can initialise a CUDA context.

When I run docker compose convert, I can confirm that CloudFormation stack requests the exact same instance type and AMI as I set up manually. On my testing instance, when I visit /usr/lib/x86_64-linux-gnu/, the driver is there. When I visit /usr/lib/x86_64-linux-gnu/ inside the container on my testing instance, the driver is there. Unfortunately, I am not sure how I can check this when run through CloudFormation and the ECS agent.

Expected Behavior

Application runs correctly, as on a manually created instance and container.

Observed Behavior

Application crashes with the following error: The NVIDIA driver was unable to open 'libnvidia-glvkspirv.so.460.32.03'. This library is required at run time.

Environment Details

  • Region: eu-west-2
  • Instance type: g4dn.xlarge
  • AMI: AWS Deep Learning AMI (Ubuntu 18.04) (ami-038c8cb7a0abc78b0)
workaround available

All 7 comments

Hi @luc122c , Is it possible for you share the steps to repro the error on our end?

Hi @shubham2892. To debug the issue, I'm using the following command to search for the Vulkan drivers: ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk. I'm testing on the following two setups:

  • Local dev machine: HP Z240 with NVIDIA Quadro K1200 running Ubuntu 20.04
  • EC2 Instance: g4dn.xlarge with Deep Learning AMI (Ubuntu 18.04) Version 42.1 (ami-021bee75c672c1e00)

The NVIDIA System Management Interface (nvidia-smi) command line utility shows the following output for the two machines:

Local machine

$ nvidia-smi
Wed Apr  7 16:59:41 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro K1200        On   | 00000000:01:00.0 Off |                  N/A |
| 39%   34C    P8     1W /  35W |     13MiB /  4043MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       891      G   /usr/lib/xorg/Xorg                  8MiB |
|    0   N/A  N/A      1134      G   /usr/bin/gnome-shell                1MiB |
+-----------------------------------------------------------------------------+

EC2 Instance

$ nvidia-smi
Wed Apr  7 15:59:46 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Now, when I run the command on the host machines, I get the following output:

Local machine

$ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk
libnvidia-glvkspirv.so.460.32.03

EC2 Instance

$ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk
libnvidia-glvkspirv.so.450.80.02

So the Vulkan driver is available in both cases. The next step is to see if they are available within containers on these machines. To do this, consider the following docker compose file:

version: '3.8'
services:
  TestServer:
    image: adamrehn/ue4-runtime:18.04-cudagl10.2
    command: "sh -c 'ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk'"
    environment:
        - NVIDIA_DRIVER_CAPABILITIES=all
    deploy:
      resources:
        reservations:
          memory: 2Gb
          cpus: "1"
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

This will execute the command in a container from the imageadamrehn/ue4-runtime:18.04-cudagl10.2 which is the one i'm using in my application. I'm using the new Docker Compose CLI to run this on both machines. When I do, I get the following output:

Local machine

Attaching to data_TestServer_1
TestServer_1  | libnvidia-glvkspirv.so.460.32.03
data_TestServer_1 exited with code 0

EC2 Instance

Attaching to TestServer_1
TestServer_1  | libnvidia-glvkspirv.so.450.80.02
TestServer_1 exited with code 0

The next step is to use the Docker Compose ECS integration to launch this on AWS ECS. After configuring docker to use an ECS context, I run docker compose convert. I see the following lines in the Launch Configuration section for the EC2 instance:

ImageId: ami-082e298f790f88621
InstanceType: g4dn.xlarge

This tells me that ECS is launching an instance of amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs which is an Amazon Linux 2 image. Therefore, drivers will not be located in /usr/lib/x86_64-linux-gnu/ on the host machine, however the runtime image I'm using in docker is still Ubuntu. When I boot up an instance with this AMI and run the docker compose script above, I get an error. When I remove the grep and simply list the drivers, I can see that the following NVIDIA drivers have been loaded into the container:

TestServer_1  | libnvidia-cfg.so.1
TestServer_1  | libnvidia-cfg.so.460.32.03
TestServer_1  | libnvidia-compiler.so.460.32.03
TestServer_1  | libnvidia-eglcore.so.460.32.03
TestServer_1  | libnvidia-encode.so.1
TestServer_1  | libnvidia-encode.so.460.32.03
TestServer_1  | libnvidia-fatbinaryloader.so.440.118.02
TestServer_1  | libnvidia-fbc.so.1
TestServer_1  | libnvidia-fbc.so.460.32.03
TestServer_1  | libnvidia-glcore.so.460.32.03
TestServer_1  | libnvidia-glsi.so.460.32.03
TestServer_1  | libnvidia-ifr.so.1
TestServer_1  | libnvidia-ifr.so.460.32.03
TestServer_1  | libnvidia-ml.so.1
TestServer_1  | libnvidia-ml.so.460.32.03
TestServer_1  | libnvidia-opencl.so.1
TestServer_1  | libnvidia-opencl.so.460.32.03
TestServer_1  | libnvidia-ptxjitcompiler.so.1
TestServer_1  | libnvidia-ptxjitcompiler.so.440.118.02
TestServer_1  | libnvidia-ptxjitcompiler.so.460.32.03
TestServer_1  | libnvidia-tls.so.460.32.03

The Vulkan driver is missing. I don't know Amazon Linux 2 well enough to know if this is an issue with the AMI, or the NVIDIA container toolkit, or ECS or the ECS Agent. I hope I have explained my issue sufficiently. If you can suggest any further troubleshooting steps, I'd be more than happy to try them. If this is not the right place and you think the problem is somewhere else, please kindly point me in the right direction. Many thanks.

Hi @luc122c,

I launched an instance of amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs and ran nvidia-container-cli -k -d /dev/tty info to get nvidia-container info

[root@ip-172-31-39-89 bin]# nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0408 21:41:11.105640 21255 nvc.c:276] initializing library context (version=1.0.0)
I0408 21:41:11.105706 21255 nvc.c:250] using root /
I0408 21:41:11.105721 21255 nvc.c:251] using ldcache /etc/ld.so.cache
I0408 21:41:11.105732 21255 nvc.c:252] using unprivileged user 65534:65534
I0408 21:41:11.106893 21256 nvc.c:186] loading kernel module nvidia
I0408 21:41:11.107067 21256 nvc.c:198] loading kernel module nvidia_uvm
I0408 21:41:11.107191 21256 nvc.c:206] loading kernel module nvidia_modeset
I0408 21:41:11.107447 21257 driver.c:133] starting driver service
I0408 21:41:11.661803 21255 nvc_info.c:433] requesting driver information with ''
I0408 21:41:11.662119 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-tls.so.460.32.03
I0408 21:41:11.662173 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.460.32.03
I0408 21:41:11.662225 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-opencl.so.460.32.03
I0408 21:41:11.662259 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ml.so.460.32.03
I0408 21:41:11.662307 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-ifr.so.460.32.03
I0408 21:41:11.662352 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-glsi.so.460.32.03
I0408 21:41:11.662387 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-glcore.so.460.32.03
I0408 21:41:11.662420 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-fbc.so.460.32.03
I0408 21:41:11.662466 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-encode.so.460.32.03
I0408 21:41:11.662511 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-eglcore.so.460.32.03
I0408 21:41:11.662545 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-compiler.so.460.32.03
I0408 21:41:11.662578 21255 nvc_info.c:147] selecting /usr/lib64/libnvidia-cfg.so.460.32.03
I0408 21:41:11.662645 21255 nvc_info.c:147] selecting /usr/lib64/libnvcuvid.so.460.32.03
I0408 21:41:11.662744 21255 nvc_info.c:147] selecting /usr/lib64/libcuda.so.460.32.03
I0408 21:41:11.662814 21255 nvc_info.c:147] selecting /usr/lib64/libGLX_nvidia.so.460.32.03
I0408 21:41:11.662848 21255 nvc_info.c:147] selecting /usr/lib64/libGLESv2_nvidia.so.460.32.03
I0408 21:41:11.662880 21255 nvc_info.c:147] selecting /usr/lib64/libGLESv1_CM_nvidia.so.460.32.03
I0408 21:41:11.662920 21255 nvc_info.c:147] selecting /usr/lib64/libEGL_nvidia.so.460.32.03
W0408 21:41:11.662945 21255 nvc_info.c:298] missing library libnvidia-fatbinaryloader.so

and I see that libnvidia-glvkspirv.so.460.32.03 is not being loaded. Checking if this issue is from ECS AMI side.

Hi @luc122c, so the ECS optimized AMIs are installed with old version of libnvidia-container v1.0.0 which does not load Vulkan driver. The workaround would be to install the the latest version nvidia-container-cli from RHEL-based distributions using user data of AMI to mitigate this issue.

Tested the nvidia-container-cli update using following steps:

  1. Launched an amzn2-ami-ecs-gpu-hvm-2.0.20210331-x86_64-ebs isntance (ami-03b6cac08fe853ac8) with g4dn.xlarge instance type in us-west-2
  2. Downloaded the libnvidia-container.repo yum repository
    DIST=$(. /etc/os-release; echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/$DIST/libnvidia-container.repo | \ sudo tee /etc/yum.repos.d/libnvidia-container.repo
  3. Remove the old version of nvidia-container-cli to avoid conflict
    yum remove libnvidia-container
  4. Installed the following packages. This installs version 1.3.3 of nvidia-container-cli
    yum install libnvidia-container1 yum install libnvidia-container-tools --disablerepo=\* --enablerepo=libnvidia-container yum install nvidia-container-runtime-hook
  5. Ran an Unreal Engine container
    docker run -it --env NVIDIA_DRIVER_CAPABILITIES=all --gpus=all adamrehn/ue4-runtime:18.04-cudagl10.2 sh
  6. Listed the drivers from inside the container
    $ ls /usr/lib/x86_64-linux-gnu/ | grep libnvidia-glvk libnvidia-glvkspirv.so.460.32.03

Closing this issue as this is not a bug from ECS Agent. Please feel free re-open the issue if you have more questions.

Hi @mythri-garaga, thank you very much for your investigation and for providing a workaround! I'll update my docker compose file.

This may not be your area but do you know if this will be fixed in an upcoming AMI release? Is there somewhere I can open a ticket or subscribe to updates?

@luc122c We are tracking this internally and we will update here when there is a libnvidia-container package update on ECS optimized AMIs.

Was this page helpful?
0 / 5 - 0 ratings