Containers-roadmap: [ECS] How to share a single GPU with multiple containers

Created on 7 Mar 2019  路  6Comments  路  Source: aws/containers-roadmap

Summary

I'd like to share the single GPU of a p3.2xlarge instance with multiple containers in the same task.

Description

In the ECS task definition it's not possible to indicate a single GPU can be shared between containers (or to distribute the GPU resource over multiple containers like with CPU units).

I have multiple containers that require a GPU but not at the same time. Is there a way run them in a single task on the same instance?
I've tried leaving the GPU unit resource blank but then the GPU device is not visible to the container.

ECS Proposed

Most helpful comment

For future reference, my current workaround to have multiple containers share a single GPU:

  1. On a running ECS GPU optimized instance, make nvidia-runtime the default runtime for dockerd by adding --default-runtime nvidia to the OPTIONS variable in /etc/sysconfig/docker
  2. Save the instance to a new AMI
  3. In CloudFormation go the Stack created by the ECS cluster wizard and update the EcsAmiId field in the initial template
  4. Restart your services

Since the default runtime is now nvidia, all containers can access the GPU. You can leave the GPU field empty in the task definition wizard (or set it to 1 for only 1 container to make sure the task is put on a GPU instance).

Major drawback of this workaround is of course forking the standard AMI.

All 6 comments

Hey, We don't have support for sharing a single GPU with multiple containers right now. We have marked it as feature request.

For future reference, my current workaround to have multiple containers share a single GPU:

  1. On a running ECS GPU optimized instance, make nvidia-runtime the default runtime for dockerd by adding --default-runtime nvidia to the OPTIONS variable in /etc/sysconfig/docker
  2. Save the instance to a new AMI
  3. In CloudFormation go the Stack created by the ECS cluster wizard and update the EcsAmiId field in the initial template
  4. Restart your services

Since the default runtime is now nvidia, all containers can access the GPU. You can leave the GPU field empty in the task definition wizard (or set it to 1 for only 1 container to make sure the task is put on a GPU instance).

Major drawback of this workaround is of course forking the standard AMI.

@robvanderleek: thanks for outlining this workaround for now =]

@robvanderleek We have a solution for EKS now. Please let us know if you are interested in it

Hi @Jeffwan

Thanks for the notification but we are happy with what ECS offers in general. Our inference cluster is running fine on ECS, although we have a custom AMI with the nvidia-docker hack.

Do you expect this solution to also become available for ECS?

@robvanderleek This is implemented like a device plugin in Kubernetes. I doubt it can be used in ECS directly. But overall the GPU sharing theory is similar and I think ECS can adopt a similar solution

Was this page helpful?
0 / 5 - 0 ratings