Recently we were manually running containers with nvidia-docker on GPU instance. Since nvidia-docker should be a thin wrapper of docker it'll be great if it could be support natively by ECS.
@chenliu0831 Thanks for reaching out to us, this seems to be the same as #433, please track there for updates. Also you can find instructions on running GPU workloads on ECS with CloudFormation in this post Orchestrating GPU-Accelerated Workloads on Amazon ECS.
Thanks,
Richard
@richardpen Thanks for responding. If I understand correctly, nvidia-docker could make use of GPU driver on the host whereas #433 and Amazon's sample will require driver be built into the container. So IMHO this issue isn't strictly the same with #433, but those are definitely workarounds.
I second this. There are ways to work around this indeed. However, having the option for the end user to launch an image using nvidia-docker (for GPU instances) would alleviate a lot of development time and potential for errors; namely when drivers are inevitably changed.
Thank you.
Can we just alias nvidia-docker docker in the cloud_init / boot hook using user data?
Can we just alias nvidia-docker docker in the cloud_init / boot hook using user data?
That won't have the effect of making ECS support nvidia-docker. ECS integrates with the Docker Remote API, while nvidia-docker is a wrapper around the Docker CLI.
@samuelkarp You are right.
However would just mounting the volume with nvidia drivers (on a custom AMI which has nvidia-docker & ecs agent installed ) as described here work?
http://docs.aws.amazon.com/batch/latest/userguide/example-job-definitions.html#example-test-gpu
{
"containerProperties": {
"mountPoints": [{
"sourceVolume": "nvidia",
"readOnly": false,
"containerPath": "/usr/local/nvidia"
}],
"image": "nvidia/cuda",
"vcpus": 2,
"command": ["nvidia-smi"],
"volumes": [{
"host": {"sourcePath": "/var/lib/nvidia-docker/volumes/nvidia_driver/latest"},
"name": "nvidia"
}],
"memory": 2000,
"privileged": true,
"ulimits": []
},
"type": "container",
"jobDefinitionName": "nvidia-smi"
}
My assumption was that all nvidia-docker did was adding a volume entry for the drivers. Since ECS and Batch have same architecture the solution for latter would work with former as well.
I was able to successfully update AWS Deep Learning AMI using instructions provided in AWS Batch documentation except with couple of changes changes and got it to work with ECS agent.
http://docs.aws.amazon.com/batch/latest/userguide/batch-gpu-ami.html
Changes resulting script is here:
Ignore/Remove the EFS mount lines below. You need to replace the {} with name of your ECS cluster.
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0
--==BOUNDARY==
Content-Type: text/cloud-boothook; charset="us-ascii"
# Install nfs-utils
cloud-init-per once yum_update yum update -y
cloud-init-per once install_nfs_utils yum install -y nfs-utils
# Create /efs folder
cloud-init-per once mkdir_efs mkdir /efs
# Mount /efs
cloud-init-per once mount_efs echo -e '{}:/ /efs nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 0 0' >> /etc/fstab
mount -a
--==BOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
# Set any ECS agent configuration options
echo "ECS_CLUSTER={}" >> /etc/ecs/ecs.config
sudo nvidia-smi -pm 1
sudo nvidia-smi --auto-boost-default=0
sudo nvidia-smi -ac 2505,875
--==BOUNDARY==--
After this I ensured that nvidia-driver volume and privileged mode were correctly added
to task
e.g.
{
"containerProperties": {
"mountPoints": [{
"sourceVolume": "nvidia",
"readOnly": false,
"containerPath": "/usr/local/nvidia"
}],
...
"volumes": [{
"host": {"sourcePath": "/var/lib/nvidia-docker/volumes/nvidia_driver/latest"},
"name": "nvidia"
}],
...
"privileged": true,
},
"type": "container",
}
After running a job I manually verified that GPU was being used by nvidia-smi.
Edited above instruction with new AMI that has ECS agent pre-updated.
@AKSHAYUBHAT that's great. Btw I'm using those flags to remove the privileged mode (it's recently added to ECS, see issues of device flag above)
-v /var/lib/nvidia-docker/volumes/nvidia_driver/latest:/usr/local/nvidia \
--device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia-uvm:/dev/nvidia-uvm
Most helpful comment
@richardpen Thanks for responding. If I understand correctly,
nvidia-dockercould make use of GPU driver on the host whereas #433 and Amazon's sample will require driver be built into the container. So IMHO this issue isn't strictly the same with #433, but those are definitely workarounds.