Amazon-ecs-agent: ECS is broken for larger sized containers

Created on 30 Jun 2018 · 13Comments · Source: aws/amazon-ecs-agent

Related to #716.

I am getting this error message when trying to run my task:

DockerGoClient: failed to pull image 508071789853.dkr.ecr.ap-southeast-2.amazonaws.com/docker-buildkite-agent: write /var/lib/docker/tmp/GetImageBlob721002426: no space left on device

The issue is that the root partition is being used for temp files instead of the EBS.

Note: the agent could symlink between /var/lib/docker and xvdc hence why it is raised here.

kinquestion

Source

nadenf

Most helpful comment

Hi @baank,

Thanks for reaching out! You are correct, layers are pulled first to a temporary location on the root volume before being applied into the layer store located on the secondary volume (/dev/xvdcz). Docker does this as the pull process consists of two parts: downloading the compressed tars that represent each layer and then uncompressing/applying those tars into the layer store.

The ECS-optimized AMI comes with an 8 GiB root volume and 22 GiB secondary volume for layer storage. The root volume is used both for temporary storage during the pull process as well as the backing storage for any volumes you create within your container or task. We stay under a 30 GiB total size so that customers who are just trying AWS for the first time can stay within the EBS free tier. These sizes are just defaults though; you can adjust them as you launch your instance (either directly through a run-instances call or through your Launch Configuration).

Note: the agent could symlink between /var/lib/docker and xvdc hence why it is raised here.

We configure layer storage in the ECS-optimized AMI to use the devicemapper "direct-lvm" mode. The secondary volume (/dev/xvdcz) is passed to LVM as a raw device and is not mounted into the host's mount namespace. While we could have symlinked /var/lib/docker onto /dev/xvdcz, that would require operating in the "loop-lvm" mode which has significant performance drawbacks under certain workloads.

ECS provides a managed container orchestration service where the instances are fully under your control; ECS provides a friendly API, manages the container lifecycle, and integrates deeply with other AWS services. We provide defaults that we believe are reasonable and serve the majority of use-cases, while allowing you enough control to change these defaults when they're insufficient for your use-case. Customizing the size of the EBS volumes allocated to your instances is possible through the AWS console when you launch your instance or by changing the BlockDeviceMapping (when using the API or a CloudFormation template).

AWS Fargate provides a fully-managed container runtime environment and that might be more of what you're looking for if you want a hands-off approach. However, Fargate does enforce a limit on the layer size as well, see the documentation for full details.

Sam

samuelkarp on 5 Jul 2018

👍4 ❤1

All 13 comments

Hi @baank,
This is docker behavior, where it stages pulled layers in docker's local storage area, usually /var/lib/docker in linux. (see https://docs.docker.com/v17.09/engine/userguide/storagedriver/imagesandcontainers/#sharing-promotes-smaller-images)
You can modify this behavior using the -g flag (see https://docs.docker.com/engine/reference/commandline/dockerd/#miscellaneous-options, for versions before v17.05.0), or --data-root flag.
Please let us know if you have any further questions on this issue.

yhlee-aws on 3 Jul 2018

Sorry but this position is just not acceptable. ECS does not work with larger sized containers. End of story.

And any workaround defeats the purpose of using a managed service like ECS.

nadenf on 3 Jul 2018

👍2 👎1

@baank I'd argue the description change is incorrect. I think the correct issue is still the "default" Amazon Linux ECS Optimized AMI comes with a small (I assume 8GB?) root volume.

We use ECS in production now with a 50GB dedicated EBS volume for /var/lib/docker and have no issues, with some large images in the multiple GB range. We use a custom AMI to fulfil our goals, but you can probably get away with the default AMI + some creative userdata script to handle the secondary partition, or use a larger root volume potentially.

CpuID on 3 Jul 2018

@CpuID .. I don't believe so at all.

The better solution IMHO would be to have the agent symlink /var/lib/docker back to the xvdc partition where it should've been from the beginning. Or failing that tell Docker to write it's data some place else.

Also your workaround isn't all that useful since it prevents auto-scaling. Not to mention what is the point of ECS if we are fiddling around with AMIs, Docker daemons etc. It's supposed to be simple.

nadenf on 3 Jul 2018

The better solution IMHO would be to have the agent symlink /var/lib/docker back to the xvdc partition where it should've been from the beginning. Or failing that tell Docker to write it's data some place else.

Fair call, I still think your statement reflects the issue more accurately than "ECS is broken for larger sized containers" though :)

Also you're workaround isn't all that useful since it prevents auto-scaling. Not to mention what is the point of ECS if we are fiddling around with AMIs, Docker daemons etc. It's supposed to be simple.

I think there are different use cases out there, from super simple to very complexed implementations. Having a default AMI to cover 80-90% of use cases that "just works", with the ability to customise when necessary seems perfectly normal to me.

CpuID on 3 Jul 2018

But the point is still that ECS is broken for larger sized containers. It simply doesn't work.

There is no user facing error message and nothing mentioned in the documentation. You have to enable ECS_DEBUG and dig through agent logs to figure out what is happening in the first place.

All for a service that is supposed to be managed.

nadenf on 3 Jul 2018

👎1

Hi @baank,

Note: the agent could symlink between /var/lib/docker and xvdc hence why it is raised here.

Sam

samuelkarp on 5 Jul 2018

👍4 ❤1

Thanks for the reply. It's great to hear the design decisions that underpin the product.

There are a number of problems with the current situation though:

1) If there isn't enough disk space to do the pull it fails silently. There is no user facing error.
2) Maximum docker size restrictions aren't documented.
3) This restriction appears to be the same for Fargate.
4) Managing our own AMI is not preferred for large, non-IT enterprises as we own significant security and operational risks by doing so. This isn't just one team we are dealing with. Ideally, AWS would offer an ECS Optimised AMI for larger containers.

We've raised this with our AWS Enterprise support team but never sure if those go back to the Dev team so if you could pass on this feedback that would be great.

nadenf on 5 Jul 2018

We've raised this with our AWS Enterprise support team but never sure if those go back to the Dev team so if you could pass on this feedback that would be great.

You've reached the dev team here! :smile:

If there isn't enough disk space to do the pull it fails silently. There is no user facing error.

I've opened https://github.com/aws/amazon-ecs-agent/issues/1434 to track this.

Maximum docker size restrictions aren't documented.

The maximums are really just defaults that can be changed. I'll reach out to our documentation team and see if they can add some detail, or you're welcome to contribute to the documentation here

Managing our own AMI is not preferred for large, non-IT enterprises as we own significant security and operational risks by doing so. This isn't just one team we are dealing with. Ideally, AWS would offer an ECS Optimised AMI for larger containers.

I'm not sure if this was clear from my comment before, but you can change the sizes of the volumes without having to build your own AMI; the ECS-optimized AMI can be launched with much larger volumes for both /dev/xvda (root) and /dev/xvdcz (layer storage) by modifying your RunInstances API call, your Launch Configuration, or the CloudFormation template you're using.

samuelkarp on 6 Jul 2018

This issue comes back up with Amazon Linux 2 optimized for ECS. It's different since the layer storage doesn't exist, but putting /var/lib/docker/overlay2 and /var/lib/docker/volumes on different drives causes pain.

I can work around it. Just stating it here.

tedder on 23 Oct 2018

@tedder what is the work around for the problem that you stated? We also have hit this and there is no other option than keep on increasing the EBS volume size.

sbirla09 on 13 Mar 2019

I worked around it by mounting additional drives and more aggressively pruning. It's certainly easier with modern ECS, which doesn't spawn infinite failing children like it used to, but yeah.

tedder on 13 Mar 2019

I have yet to have this issue happen for us since we moved to AL2.
The root volume is 200-500 GB and we load multiple 10-20GB images.

With amazon Linux 1 is was a bit hit and miss, and required modifying the dm pool