Amazon-ecs-agent: ECS agent stops tasks with "failed to pull container" when it's already successfully pulled the image.

Created on 11 Mar 2016 · 16Comments · Source: aws/amazon-ecs-agent

We've been seeing an issue recently where the ECS agent will stop tasks with the error "CannotPullContainerError":

Status reason   CannotPullContainerError: Error: image deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd not found

We've dug into this a bit, and it looks like the agent is in fact downloading the image properly, but throw the error anyways:

2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b3Downloading9c11881be2f16bd" status="Downloading [==================================================>] 157.3 MB/157.3 MB
"
2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd" status="Verifying Checksum
"
2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd" status="Pulling repository 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar
"
2016-03-11T17:07:54Z [DEBUG] Pulling image complete module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [DEBUG] Pull completed for image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [INFO] Error transitioning container module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]" container="deathstar_staging(604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd) (NONE->RUNNING)" state="PULLED"
2016-03-11T17:07:54Z [DEBUG] Got container event for task module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]"
2016-03-11T17:07:54Z [DEBUG] Handling container change module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]" change="{container:0xc20835cc60 event:{Status:1 DockerContainerMetadata:{DockerId: ExitCode:<nil> PortBindings:[] Error:{transition:Pull msg:Error: image deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd not found} Volumes:map[]}}}"

The agent for this host is still connected, and can reach ECR. In the past, we've been able to solve this by terminating the affected host. Not a great long term solution for us, though.

more info needed

Source

abby-fuller

Most helpful comment

Those who stumble here wondering what's up: don't forget to include the entire repository hostname in your image name when defining a task :) e.g. 123456.dkr.ecr.eu-west-1.amazonaws.com/your-repo/your-image-name

atcol on 23 Jul 2016

👍15

All 16 comments

What version of the ECS Agent and Docker are you using?

2016-03-11T17:07:54Z [DEBUG] Pulling image complete module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [DEBUG] Pull completed for image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"

These two lines are confusing and get emitted even when there is an error (see here and here). Can you check the Docker daemon log and see what information it has? We've found that the Docker remote API (which is how the Agent communicates with Docker) tends to return less information than the daemon shows in the logs. You may want to turn on debug mode for the Docker daemon with --debug.

samuelkarp on 11 Mar 2016

😕1

Docker version 1.7, agent version 1.8.1. We were previously using Docker 1.9, but rolled back because of the xfs errors. We see pretty much the same info in the docker logs- trying to get the container returns a 404:

time="2016-03-11T18:02:46.673928916Z" level=info msg="POST /v1.17/containers/create?name=ecs-deathstar_staging-426-deathstarstaging-c6a9949bc1e6d2be8d01" 
time="2016-03-11T18:02:46.674782075Z" level=error msg="Handler for POST /containers/create returned error: No such image: 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd (tag: 9ce7d67bf402b337a75c8acf49c11881be2f16bd)" 
time="2016-03-11T18:02:46.674835401Z" level=error msg="HTTP Error" err="No such image: 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd (tag: 9ce7d67bf402b337a75c8acf49c11881be2f16bd)" statusCode=404

this is also possibly related:

time="2016-03-11T18:04:06.354712941Z" level=debug msg="provided manifest reference \"9ce7d67bf402b337a75c8acf49c11881be2f16bd\" is not a digest: invalid checksum digest format"
time="2016-03-11T18:04:06.355721646Z" level=debug msg="Key check result: no graph"

abby-fuller on 11 Mar 2016

Thanks for providing those details. I'll ask my colleagues on the ECR team if they've seen this before with Docker 1.7.

samuelkarp on 11 Mar 2016

fwiw, i believe it was also happening when we were on 1.9

abby-fuller on 11 Mar 2016

We're experiencing this issue, too, on some container hosts but not all. There is also a forum thread about it https://forums.aws.amazon.com/thread.jspa?threadID=227929&tstart=0 but it was not especially helpful.

jbergknoff on 5 Apr 2016

I apologize for delay in the response here.

Looking through our logs from this time period, we do see Docker falling back to the "/v1" registry calls for this repository, which returns a 404 because ECR is not a v1 registry. This results in a "No such image" error coming back from Docker. This can indicate that the Docker daemon is not able to download the original manifest, and is then attempting to download it as a v1 registry. Some of these fallbacks have been removed in newer versions of Docker (1.10 and later), so it will be slightly easier to debug these issues after upgrading.

We've seen cases reported where "No such image" can happen on a host running out of disk space or not allowing access to S3 in a VPC. The above forum thread linked calls out both of these. The invalid checksum digest is indicating that after the download was complete, Docker re-calculated what was on disk and it didn't match the expected digest. In cases were a disk is full or a download was terminated in an unexpected way, this checksum calculation can be incorrect because the manifest or layer itself is incomplete or empty.

Can you provide your docker info & docker daemon debug output for this time frame? This should help provide more information for us to help in debugging this issue.

sentientmonkey on 26 May 2016

Data point: these errors went away for us when we upgraded to the amzn-ami-2016.03.a-amazon-ecs-optimized AMI.

jbergknoff on 26 May 2016

Hello @abby-fuller,
I'm going to go ahead and close this issue as we haven't heard from you in while. Please feel free to reopen the issue if you are still experiencing problems.

juanrhenals on 2 Jun 2016

Hi @juanrhenals. I'm also hitting this problem regularly. I'm currently using the newest ECS optimized AMI.

Re-open maybe?

schickling on 29 Jun 2016

👍8

I'm experiencing this issue as well. Also, I'm currently using the newest ECS optimized AMI.

cjpetrus on 12 Jul 2016

Yeah I am having this problem too using ECR

Pcummings on 15 Jul 2016

Sorry I'm an idiot - I forgot the image tag, it works flawlessly!

Pcummings on 15 Jul 2016

👍1

atcol on 23 Jul 2016

👍15

Ran into the exact same problem, and I've always been specifying the full repo hostname + image tag all along.

I realized the error was happening only on a single EC2 machine in the cluster, so I terminated it and let my ASG bring another machine back up, and the next task run on that node worked beautifully.

Not sure what could have been the root problem. Before terminating, I SSHed in and checked for disk space (17% usage); nothing seemed out of the ordinary. Perhaps I could have tried running some docker pull commands to debug further? Maybe next time.

smoll on 27 Jul 2016

Another node ran into the exact same problem, and I sshed into this machine. It looks like all docker pull commands are failing:

$ docker pull hello-world:latest
latest: Pulling from library/hello-world
6432e0ccba2d: Download complete
95f1eedc264a: Download complete
Pulling repository docker.io/library/hello-world
Tag latest not found in repository docker.io/library/hello-world

so it's not limited to ECR at all. Anything else I can try to debug this?

smoll on 27 Jul 2016

I needed to ssh into the machine and clean up old docker images to fix this. Docker keeps every tagged image on the machine and it'll prevent it from downloading new images.

A solution I've found is adding a cron job that cleans up the old/unused docker images.

0 * * * * docker rmi $(docker images -q)

nickromano on 2 Sep 2016

👍8 🎉3 ❤1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Logentries docker driver

AbelGuti · 5Comments

container stopped immediately when run with ECS Task but stays run with 'docker run'

YurgenUA · 3Comments

Ability to write structured logs in JSON

dcosson · 3Comments

Determine a proper timeout for LoadImage

aaithal · 3Comments

tmpfs and dev shm for Fargate

taktakpeops · 4Comments