We've been seeing an issue recently where the ECS agent will stop tasks with the error "CannotPullContainerError":
Status reason CannotPullContainerError: Error: image deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd not found
We've dug into this a bit, and it looks like the agent is in fact downloading the image properly, but throw the error anyways:
2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b3Downloading9c11881be2f16bd" status="Downloading [==================================================>] 157.3 MB/157.3 MB
"
2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd" status="Verifying Checksum
"
2016-03-11T17:07:54Z [DEBUG] Pulling image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd" status="Pulling repository 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar
"
2016-03-11T17:07:54Z [DEBUG] Pulling image complete module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [DEBUG] Pull completed for image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [INFO] Error transitioning container module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]" container="deathstar_staging(604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd) (NONE->RUNNING)" state="PULLED"
2016-03-11T17:07:54Z [DEBUG] Got container event for task module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]"
2016-03-11T17:07:54Z [DEBUG] Handling container change module="TaskEngine" task="deathstar_staging:426 arn:aws:ecs:us-east-1:604238712147:task/18596178-a7fa-407c-b95a-2270885d4524, Status: (NONE->RUNNING) Containers: [deathstar_staging (NONE->RUNNING),]" change="{container:0xc20835cc60 event:{Status:1 DockerContainerMetadata:{DockerId: ExitCode:<nil> PortBindings:[] Error:{transition:Pull msg:Error: image deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd not found} Volumes:map[]}}}"
The agent for this host is still connected, and can reach ECR. In the past, we've been able to solve this by terminating the affected host. Not a great long term solution for us, though.
What version of the ECS Agent and Docker are you using?
2016-03-11T17:07:54Z [DEBUG] Pulling image complete module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
2016-03-11T17:07:54Z [DEBUG] Pull completed for image module="TaskEngine" image="604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd"
These two lines are confusing and get emitted even when there is an error (see here and here). Can you check the Docker daemon log and see what information it has? We've found that the Docker remote API (which is how the Agent communicates with Docker) tends to return less information than the daemon shows in the logs. You may want to turn on debug mode for the Docker daemon with --debug.
Docker version 1.7, agent version 1.8.1. We were previously using Docker 1.9, but rolled back because of the xfs errors. We see pretty much the same info in the docker logs- trying to get the container returns a 404:
time="2016-03-11T18:02:46.673928916Z" level=info msg="POST /v1.17/containers/create?name=ecs-deathstar_staging-426-deathstarstaging-c6a9949bc1e6d2be8d01"
time="2016-03-11T18:02:46.674782075Z" level=error msg="Handler for POST /containers/create returned error: No such image: 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd (tag: 9ce7d67bf402b337a75c8acf49c11881be2f16bd)"
time="2016-03-11T18:02:46.674835401Z" level=error msg="HTTP Error" err="No such image: 604238712147.dkr.ecr.us-east-1.amazonaws.com/deathstar:9ce7d67bf402b337a75c8acf49c11881be2f16bd (tag: 9ce7d67bf402b337a75c8acf49c11881be2f16bd)" statusCode=404
this is also possibly related:
time="2016-03-11T18:04:06.354712941Z" level=debug msg="provided manifest reference \"9ce7d67bf402b337a75c8acf49c11881be2f16bd\" is not a digest: invalid checksum digest format"
time="2016-03-11T18:04:06.355721646Z" level=debug msg="Key check result: no graph"
Thanks for providing those details. I'll ask my colleagues on the ECR team if they've seen this before with Docker 1.7.
fwiw, i believe it was also happening when we were on 1.9
We're experiencing this issue, too, on some container hosts but not all. There is also a forum thread about it https://forums.aws.amazon.com/thread.jspa?threadID=227929&tstart=0 but it was not especially helpful.
I apologize for delay in the response here.
Looking through our logs from this time period, we do see Docker falling back to the "/v1" registry calls for this repository, which returns a 404 because ECR is not a v1 registry. This results in a "No such image" error coming back from Docker. This can indicate that the Docker daemon is not able to download the original manifest, and is then attempting to download it as a v1 registry. Some of these fallbacks have been removed in newer versions of Docker (1.10 and later), so it will be slightly easier to debug these issues after upgrading.
We've seen cases reported where "No such image" can happen on a host running out of disk space or not allowing access to S3 in a VPC. The above forum thread linked calls out both of these. The invalid checksum digest is indicating that after the download was complete, Docker re-calculated what was on disk and it didn't match the expected digest. In cases were a disk is full or a download was terminated in an unexpected way, this checksum calculation can be incorrect because the manifest or layer itself is incomplete or empty.
Can you provide your docker info & docker daemon debug output for this time frame? This should help provide more information for us to help in debugging this issue.
Data point: these errors went away for us when we upgraded to the amzn-ami-2016.03.a-amazon-ecs-optimized AMI.
Hello @abby-fuller,
I'm going to go ahead and close this issue as we haven't heard from you in while. Please feel free to reopen the issue if you are still experiencing problems.
Hi @juanrhenals. I'm also hitting this problem regularly. I'm currently using the newest ECS optimized AMI.
Re-open maybe?
I'm experiencing this issue as well. Also, I'm currently using the newest ECS optimized AMI.
Yeah I am having this problem too using ECR
Sorry I'm an idiot - I forgot the image tag, it works flawlessly!
Those who stumble here wondering what's up: don't forget to include the entire repository hostname in your image name when defining a task :) e.g. 123456.dkr.ecr.eu-west-1.amazonaws.com/your-repo/your-image-name
Ran into the exact same problem, and I've always been specifying the full repo hostname + image tag all along.
I realized the error was happening only on a single EC2 machine in the cluster, so I terminated it and let my ASG bring another machine back up, and the next task run on that node worked beautifully.
Not sure what could have been the root problem. Before terminating, I SSHed in and checked for disk space (17% usage); nothing seemed out of the ordinary. Perhaps I could have tried running some docker pull commands to debug further? Maybe next time.
Another node ran into the exact same problem, and I sshed into this machine. It looks like all docker pull commands are failing:
$ docker pull hello-world:latest
latest: Pulling from library/hello-world
6432e0ccba2d: Download complete
95f1eedc264a: Download complete
Pulling repository docker.io/library/hello-world
Tag latest not found in repository docker.io/library/hello-world
so it's not limited to ECR at all. Anything else I can try to debug this?
I needed to ssh into the machine and clean up old docker images to fix this. Docker keeps every tagged image on the machine and it'll prevent it from downloading new images.
A solution I've found is adding a cron job that cleans up the old/unused docker images.
0 * * * * docker rmi $(docker images -q)
Most helpful comment
Those who stumble here wondering what's up: don't forget to include the entire repository hostname in your image name when defining a task :) e.g. 123456.dkr.ecr.eu-west-1.amazonaws.com/your-repo/your-image-name