Amazon-ecs-agent: Amazon ecs vs big images

Created on 24 Feb 2017 · 10Comments · Source: aws/amazon-ecs-agent

I am trying to run task with big docker images (20-500GB). I've fixed docker options with dm.basesize=9999G. I've started task, logged in EC2 instance and I am watching amazon/amazon-ecs-agent logs:

2017-02-24T09:29:45Z [INFO] Error while pulling container; will try to run anyways module="TaskEngine" ... err="write /var/lib/docker/tmp/GetImageBlob***: no space left on device"
2017-02-24T09:29:49Z [ERROR] Error inspecting image ***.dkr.ecr.us-east-1.amazonaws.com/***:latest: no such image

I think that Amazon ECS is not ready for big docker images. Nobody told me about this problem in documentation. Amazon ECS is not a serious production system. This is very bad.

I can run my docker images using custom shell scripts. But I don't know how to support awslogs, auto scaling groups, launch configurations, etc.

more info needed

Source

andrew-aladev

Most helpful comment

I am having the same issue as @mrnugget described. The VFree shown in sudo vgs is quite small compared to the VSize. I am sure my pool is big enough and it has free space. However it fails pulling a 8.2 Gb "optimized" image from my ECS repo.

If you would like to have your ec2 ECS slave running big images you should have the following points:

Big enough thin pool (you can extend the default 22GB by adding a new EBS volume and vgextend your docker domain group -> http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-storage-config.html)
Big enough main partition that can pull your ECS repo image (you can extend the 8GB volume in your EBS volume list and then extend it with sudo growpart /dev/xvda 1 & reboot).
Big enough container size (you can extend those with a cloud-init boothook to increase dm.basesize https://aws.amazon.com/pt/premiumsupport/knowledge-center/increase-default-ecs-docker-limit/)

Amazon please make this programmatically!

Edit: please note that the dm.basesize change will only apply to images pulled after the change.

pdefreitas on 14 Aug 2017

👍3

All 10 comments

I've found that amazon is pulling docker image with unknown size into the root volume with 8GB size. When root volume is full it failed.

There is no way to inspect this error except watching "amazon/amazon-ecs-agent" logs in console. The task in admin panel just disappeared. This is ridiculous, I won't ever use and trust amazon.

andrew-aladev on 24 Feb 2017

Hi @andrew-aladev. I'm sorry you're struggling with ECS. In general, we do expect large images to work in ECS, and I'd like to gather some more info about your setup so we can help identify where things are going wrong. In particular, we do not pull images to the root filesystem, so the 8GB root should not be the issue in this case.

Are you using the ECS optimized AMI? What is the AMI ID you're using?

What version of the ecs agent and docker are you using? Can you share the output of docker info with me? If you don't want to post it here, feel free to send it to me privately via email to [email protected]. How is your image constructed? How many layers are there? Is it as large as it is because of one really big layer, or several moderately big layers?

nmeyerhans on 25 Feb 2017

I am sure that amazon/amazon-ecs-agent is pulling image or layers of this image with unknown size into the root filesystem. I saw this behaviour in tmux session. We are using default *ecs-optimized ami.

You can append OPTIONS="${OPTIONS} --storage-opt dm.basesize=9999GB" into /etc/sysconfig/docker, restart docker and see our docker info.

I am logged into ec2 instance, installed tmux.
docker <ecs-agent-id> logs -f
I am running task with several images from "Amazon ECS" -> "Task Definitions". These images are big. These images has a big layers. I've pushed these images into amazon ECR and I know that one of our image has a layer with 10GB size.
I can see that free space on root filesystem become low and than ecs-agent failed with no space left on device and no such image.
Task just disappeared from ECS cluster without logs. User is not able to know why amazon ECS failed. So I am sure that amazon ECS is not a solution for production.

I've tried to pull this image once again with docker pull <image_id> and I can see that it failed with no space left on device while downloading layer with about 10GB size.

Do you want to download images with unknown size? Your ami *ecs-optimized should have a separate scalable partition with infinite size and ecs-agent should use this partition to download and extract docker images.
You can check the size of image before downloading. You can provide an error if this image (layers) are too big.

You should provide at least basic logs for user that is sitting in your beautiful admin panel and waiting for results. This is ridiculous.

Amazon, please add big "pre alpha" icon into your "Amazon ECS" "service" page. You haven't implement and tested the most basic docker features.

andrew-aladev on 25 Feb 2017

@nmeyerhans, You are storing /var/lib/docker/image/devicemapper/layerdb an the root filesystem in your "optimized" image.

We've created a custom ami and pointed docker to use big disk as it's root (-g /mnt/disk). It works. But we couldn't use our custom ami in ECS Cluster. Why?

andrew-aladev on 27 Feb 2017

Apologies; when I tested previously, I used an image with large layers, but they compressed well enough that they fit onto the 8 GB root filesystem in our default image so I didn't experience the same issue that you describe. That was an oversight on my part. Running the same tests with a less compressible image does result in docker filling the root filesystem as it stages the layer data in /var/lib/docker.

You should be able to run custom images in your cluster. Your custom image is derived from our optimized AMI, correct? There are a couple of notable things to check:

If you built from a snapshot of an existing instance, ensure that you delete /var/lib/ecs/data/ecs_agent_data.json when snapshotting.
Double check that your instance is running with the ecsInstanceRole IAM role.

If you've confirmed all of those things, you may be able to gather additional information by inspecting the ecs-init and ecs-agent logs in /var/log/ecs. You're welcome to send me the relevant logs if you need help understanding their content.

Regarding your previous observation that the failed task simply "disappeared" from the web UI, they should still be visible, but you'll likely need to click on the "Stopped" button when viewing tasks. Our Checking Stopped Tasks for Errors documentation may be helpful there. Stopped tasks do eventually disappear, similar to, for example, terminated EC2 instances. This typically happens within 1 hour.

nmeyerhans on 28 Feb 2017

❤1

Hi @andrew-aladev I'm closing this issue as we haven't heard back from you in a while. Please feel free to reopen this issue/add additional comments if you get to the items mentioned by @nmeyerhans. Thanks!

aaithal on 15 Apr 2017

I'm running into the same issue: while there's space left on the attached EBS volumes, docker can't pull the complete image down. Here is one of the relevant lines in the ecs-agent.log:

2017-07-31T08:42:37Z [INFO] Error transitioning container module="TaskEngine" task="<OUR_TASK_DEFINITION> arn:aws:ecs:<AWS_REGION>:<ACCOUNT_ID>):task/815da137-4291-4ebe-a9af-47ba2846f629, Status: (NONE->RUNNING) Containers: [<CONTAINER_NAME> (NONE->RUNNING),]" container="<CONTAINER_NAME>(<ACCOUNT_ID>).dkr.ecr.<AWS_REGION>.amazonaws.com/<IMAGE_NAME>) (NONE->RUNNING)" state="PULLED" error="write /var/lib/docker/tmp/GetImageBlob102484030: no space left on device"

The error contained in this line: write /var/lib/docker/tmp/GetImageBlob102484030: no space left on device

Now, as @andrew-aladev already reported, it seems like /var/lib/docker/tmp does not sit on the mounted EBS volumes. So no matter how much space I give them, Docker will still run out of storage.

And yes, there is space available:

[ec2-user ~]$ sudo vgs
  VG     #PV #LV #SN Attr   VSize  VFree
  docker   1   1   0 wz--n- 80.00g 736.00m
[ec2-user ~]$ sudo lvs
  LV          VG     Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  docker-pool docker twi-aot--- 79.11g             0.53   0.27
[ec2-user ~]$ docker info | grep Space
 Data Space Used: 450.4 MB
 Data Space Total: 84.95 GB
 Data Space Available: 84.5 GB
 Metadata Space Used: 233.5 kB
 Metadata Space Total: 88.08 MB
 Metadata Space Available: 87.85 MB
 Thin Pool Minimum Free Space: 8.495 GB

Agent Version:

Amazon ECS Agent - v1.14.3 (15de319)

mrnugget on 31 Jul 2017

If you would like to have your ec2 ECS slave running big images you should have the following points:

Big enough thin pool (you can extend the default 22GB by adding a new EBS volume and vgextend your docker domain group -> http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-ami-storage-config.html)
Big enough main partition that can pull your ECS repo image (you can extend the 8GB volume in your EBS volume list and then extend it with sudo growpart /dev/xvda 1 & reboot).
Big enough container size (you can extend those with a cloud-init boothook to increase dm.basesize https://aws.amazon.com/pt/premiumsupport/knowledge-center/increase-default-ecs-docker-limit/)

Amazon please make this programmatically!

Edit: please note that the dm.basesize change will only apply to images pulled after the change.

pdefreitas on 14 Aug 2017

👍3

How to do that using Elastic Beanstalk? Any ready to use scripts?

Defozo on 5 Jun 2018

@Defozo You can use the BlockDeviceMapping, RootVolumeSize and RootVolumeIOPS configuration options for EB. See the first example here: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_docker.container.console.html#docker-volumes

soleares on 25 May 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

container stopped immediately when run with ECS Task but stays run with 'docker run'

YurgenUA · 3Comments

tmpfs and dev shm for Fargate

taktakpeops · 4Comments

Best practice to ship ECS agent logs to Cloudwatch Logs?

melo · 5Comments

Cleanup is not working when ECS mamanged image is running in non-managed container

GeyseR · 3Comments