Amazon-ecs-agent: [feature request]Support nvidia-docker on GPU instance

Created on 3 Jan 2017 · 9Comments · Source: aws/amazon-ecs-agent

Recently we were manually running containers with nvidia-docker on GPU instance. Since nvidia-docker should be a thin wrapper of docker it'll be great if it could be support natively by ECS.

Source

chenliu0831

👍5

Most helpful comment

@richardpen Thanks for responding. If I understand correctly, nvidia-docker could make use of GPU driver on the host whereas #433 and Amazon's sample will require driver be built into the container. So IMHO this issue isn't strictly the same with #433, but those are definitely workarounds.

chenliu0831 on 8 Feb 2017

👍9

All 9 comments

@chenliu0831 Thanks for reaching out to us, this seems to be the same as #433, please track there for updates. Also you can find instructions on running GPU workloads on ECS with CloudFormation in this post Orchestrating GPU-Accelerated Workloads on Amazon ECS.

Thanks,
Richard

richardpen on 3 Jan 2017

chenliu0831 on 8 Feb 2017

👍9

I second this. There are ways to work around this indeed. However, having the option for the end user to launch an image using nvidia-docker (for GPU instances) would alleviate a lot of development time and potential for errors; namely when drivers are inevitably changed.
Thank you.

milesgranger on 14 Jun 2017

👍5

Can we just alias nvidia-docker docker in the cloud_init / boot hook using user data?

AKSHAYUBHAT on 1 Oct 2017

Can we just alias nvidia-docker docker in the cloud_init / boot hook using user data?

That won't have the effect of making ECS support nvidia-docker. ECS integrates with the Docker Remote API, while nvidia-docker is a wrapper around the Docker CLI.

samuelkarp on 2 Oct 2017

@samuelkarp You are right.
However would just mounting the volume with nvidia drivers (on a custom AMI which has nvidia-docker & ecs agent installed ) as described here work?

http://docs.aws.amazon.com/batch/latest/userguide/example-job-definitions.html#example-test-gpu

{
    "containerProperties": {
        "mountPoints": [{
            "sourceVolume": "nvidia",
            "readOnly": false,
            "containerPath": "/usr/local/nvidia"
        }],
        "image": "nvidia/cuda",
        "vcpus": 2,
        "command": ["nvidia-smi"],
        "volumes": [{
            "host": {"sourcePath": "/var/lib/nvidia-docker/volumes/nvidia_driver/latest"},
            "name": "nvidia"
        }],
        "memory": 2000,
        "privileged": true,
        "ulimits": []
    },
    "type": "container",
    "jobDefinitionName": "nvidia-smi"
}

My assumption was that all nvidia-docker did was adding a volume entry for the drivers. Since ECS and Batch have same architecture the solution for latter would work with former as well.

AKSHAYUBHAT on 2 Oct 2017

You can find the public AMI "ami-b1c106cb" (updated the AMI since the last one had outdated ecs-agent) in us-east-1 region. Note that you will have to "update" the ECS agent through the AWS web console (it takes only a single click).

I was able to successfully update AWS Deep Learning AMI using instructions provided in AWS Batch documentation except with couple of changes changes and got it to work with ECS agent.
http://docs.aws.amazon.com/batch/latest/userguide/batch-gpu-ami.html
Changes resulting script is here:

Add ec2-user to docker group (this might not be required)
Change nvidia/cuda to nvidia/cuda:8.0 since default cuda image uses 9.0 which is not compatible with current drivers for p2 instances.

Ignore/Remove the EFS mount lines below. You need to replace the {} with name of your ECS cluster.

Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"

# Install nfs-utils
cloud-init-per once yum_update yum update -y
cloud-init-per once install_nfs_utils yum install -y nfs-utils

# Create /efs folder
cloud-init-per once mkdir_efs mkdir /efs

# Mount /efs
cloud-init-per once mount_efs echo -e '{}:/ /efs nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 0 0' >> /etc/fstab
mount -a

--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
# Set any ECS agent configuration options
echo "ECS_CLUSTER={}" >> /etc/ecs/ecs.config
sudo nvidia-smi -pm 1
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875
--==BOUNDARY==--

After this I ensured that nvidia-driver volume and privileged mode were correctly added
to task

e.g.

{
    "containerProperties": {
        "mountPoints": [{
            "sourceVolume": "nvidia",
            "readOnly": false,
            "containerPath": "/usr/local/nvidia"
        }],
       ...
        "volumes": [{
            "host": {"sourcePath": "/var/lib/nvidia-docker/volumes/nvidia_driver/latest"},
            "name": "nvidia"
        }],
...
        "privileged": true,
    },
    "type": "container",
}

After running a job I manually verified that GPU was being used by nvidia-smi.

AKSHAYUBHAT on 3 Oct 2017

Edited above instruction with new AMI that has ECS agent pre-updated.

AKSHAYUBHAT on 5 Oct 2017

@AKSHAYUBHAT that's great. Btw I'm using those flags to remove the privileged mode (it's recently added to ECS, see issues of device flag above)

-v /var/lib/nvidia-docker/volumes/nvidia_driver/latest:/usr/local/nvidia \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm

chenliu0831 on 5 Jan 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Upgrading from 16.2: Error retrieving credentials

flowirtz · 5Comments

AWS Parameter Store for user specific secrets

pspanchal · 3Comments

ECS agent can't pull image from ECR repository on another AWS account

AlexShuraits · 4Comments

container stopped immediately when run with ECS Task but stays run with 'docker run'

YurgenUA · 3Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments