So here's a funny one, repeated CannotPullContainerError: API error (404) for one of our services. We have multiple containers running in production, attempted to deploy another today using our CI stack. Keep getting the above error even though we're using the exact same process for our existing services.
We have a framework to handle all of this, allowing us to drop in new service app code and just push onto new ECS services. This system has been working fine so far....
Things I have tried:
I can also successfully pull the suspect image locally. I have also tried using a completely new ECR for the suspect image.
The suspect image is only 1.27GB, being pulled onto an m3.medium with nothing else running on it.
Is this an issue with the docker image (runs fine locally and on CI), with the task definition (runs fine for other images) or with AWS (?) ?
Our ECS clusters and services are managed with terraform - we have other systems running with the exact same configuration. IAM access is setup correctly.
Versions:
The app is python 2.7.
Any suggestions?
task def:
{
"requiresAttributes": [
{
"value": null,
"name": "com.amazonaws.ecs.capability.logging-driver.syslog",
"targetId": null,
"targetType": null
},
{
"value": null,
"name": "com.amazonaws.ecs.capability.ecr-auth",
"targetId": null,
"targetType": null
},
{
"value": null,
"name": "com.amazonaws.ecs.capability.task-iam-role",
"targetId": null,
"targetType": null
},
{
"value": null,
"name": "com.amazonaws.ecs.capability.docker-remote-api.1.19",
"targetId": null,
"targetType": null
}
],
"taskDefinitionArn": "arn:aws:ecs:ap-southeast-2:123456789:task-definition",
"networkMode": "bridge",
"status": "ACTIVE",
"revision": 11,
"taskRoleArn": "arn:aws:iam::123456789:role/ECSRole",
"containerDefinitions": [
{
"volumesFrom": [],
"memory": 2048,
"extraHosts": null,
"dnsServers": null,
"disableNetworking": null,
"dnsSearchDomains": null,
"portMappings": [],
"hostname": null,
"essential": true,
"entryPoint": null,
"mountPoints": [],
"name": "appname",
"ulimits": null,
"dockerSecurityOptions": null,
"environment": [
{
"name": "AWS_REGION",
"value": "ap-southeast-2"
}
],
"links": null,
"workingDirectory": null,
"readonlyRootFilesystem": null,
"image": "xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version",
"command": null,
"user": null,
"dockerLabels": null,
"logConfiguration": {
"logDriver": "syslog",
"options": {
"syslog-address": "udp://logs5.papertrailapp.com:xxxxx"
}
},
"cpu": 1024,
"privileged": null,
"memoryReservation": null
}
],
"placementConstraints": [],
"volumes": [],
"family": "appname"
}
ecs-agent log:
2017-08-24T08:06:50Z [INFO] Error transitioning container module="TaskEngine" task="appname:11 arn:aws:ecs:ap-southeast-2:123456789:task/717fe788-eadb-4cfa-a33c-d9528b60834b, Status: (NONE->RUNNING) Containers: [appname (NONE->RUNNING),]" container="appname(xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name:deploy_version) (NONE->RUNNING)" state="PULLED" error="API error (404): repository xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name not found: does not exist or no pull access
Update - updating the AMI to ami-c1a6bda2 with docker version 17.03.2-ce and agent version 1.14.4 did not solve this issue.
@L226 We are sorry about that you're still experiencing this issue with the new docker. Also thanks for your step by step debugging, which will help our service team root cause the problem. Our service team is working a similar issue that may be related, in order to root cause the problem you are experiencing, extra information is required, can you share the following information to me at: penyin (at) amazon.com:
Beyond that can you clarify the following questions?
"I can also successfully pull the suspect image locally."
Was the image pulled by cleaning all local docker information?
"I have also tried using a completely new ECR for the suspect image."
Did the image in the new repository fail in exactly the same way as the old repository or was the finding here a successful pull and task launch?
thanks
Hi, thanks for your response. I'll get those emailed to you presently.
Successful local pull means that running the following on my laptop terminal works as expected:
docker pull xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version
No local docker info was cleaned.
Yes, the new ECR failed in exactly the same way as the old one.
@L226 Thanks for clarifying the questions. Have you sent the email, it looks like I didn't receive any email about this? Would you please check again, and my email is penyin (at) amazon.com. I'll let you know when I received all the information.
thanks,
Peng
This seems to have been an issue with our CI workflow. Closing for now. Thanks for your help!
@L226 what was the CI workflow issue. Running into a similar issue on my setup. Wondering if you could elaborate.
I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.
Most helpful comment
I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.