Amazon-ecs-agent: CannotPullContainerError, ECS Service task fails to start

Created on 24 Aug 2017 · 7Comments · Source: aws/amazon-ecs-agent

So here's a funny one, repeated CannotPullContainerError: API error (404) for one of our services. We have multiple containers running in production, attempted to deploy another today using our CI stack. Keep getting the above error even though we're using the exact same process for our existing services.

Application docker image built on CI stack
Image pushed to ECR and tagged with the deploy version
CI stack crafts task definition from provided JSON skeleton, secret envvars from S3, terraform outputs and deploy version
New task definition sent to ECS service
ECS attempts to run the task with latest definition
Repeated failures to pull

We have a framework to handle all of this, allowing us to drop in new service app code and just push onto new ECS services. This system has been working fine so far....

Things I have tried:

running the 'bad' task definition on a different ECS cluster - fails to deploy
running a known good task definition on the 'bad' ECS cluster - task runs successfully
manually crafting task definition JSON identical to the 'bad' one but pointing to a known good image - task runs successfully
manually pushing a known good image tagged with the deploy version of the 'bad' task definition to that task's ECR - task fails to deploy
terminating the EC2 instance and redeploying the ECS service task - still fails to deploy
using a bastion to ssh into the instance and trying a manual pull of suspect image - fails
using a bastion to ssh into the instance and trying a manual pull of some other image (alpine) - success

I can also successfully pull the suspect image locally. I have also tried using a completely new ECR for the suspect image.

The suspect image is only 1.27GB, being pulled onto an m3.medium with nothing else running on it.

Is this an issue with the docker image (runs fine locally and on CI), with the task definition (runs fine for other images) or with AWS (?) ?

Our ECS clusters and services are managed with terraform - we have other systems running with the exact same configuration. IAM access is setup correctly.

Versions:

ECS AMI - ami-c3233ba0
docker - 17.03.1-ce
ecs-agent - 1.14.3
terraform - v0.9.5
instance - m3.medium

The app is python 2.7.

Any suggestions?

task def:

{
  "requiresAttributes": [
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.syslog",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19",
      "targetId": null,
      "targetType": null
    }
  ],
  "taskDefinitionArn": "arn:aws:ecs:ap-southeast-2:123456789:task-definition",
  "networkMode": "bridge",
  "status": "ACTIVE",
  "revision": 11,
  "taskRoleArn": "arn:aws:iam::123456789:role/ECSRole",
  "containerDefinitions": [
    {
      "volumesFrom": [],
      "memory": 2048,
      "extraHosts": null,
      "dnsServers": null,
      "disableNetworking": null,
      "dnsSearchDomains": null,
      "portMappings": [],
      "hostname": null,
      "essential": true,
      "entryPoint": null,
      "mountPoints": [],
      "name": "appname",
      "ulimits": null,
      "dockerSecurityOptions": null,
      "environment": [
        {
          "name": "AWS_REGION",
          "value": "ap-southeast-2"
        }
      ],
      "links": null,
      "workingDirectory": null,
      "readonlyRootFilesystem": null,
      "image": "xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version",
      "command": null,
      "user": null,
      "dockerLabels": null,
      "logConfiguration": {
        "logDriver": "syslog",
        "options": {
          "syslog-address": "udp://logs5.papertrailapp.com:xxxxx"
        }
      },
      "cpu": 1024,
      "privileged": null,
      "memoryReservation": null
    }
  ],
  "placementConstraints": [],
  "volumes": [],
  "family": "appname"
}

ecs-agent log:

2017-08-24T08:06:50Z [INFO] Error transitioning container module="TaskEngine" task="appname:11 arn:aws:ecs:ap-southeast-2:123456789:task/717fe788-eadb-4cfa-a33c-d9528b60834b, Status: (NONE->RUNNING) Containers: [appname (NONE->RUNNING),]" container="appname(xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name:deploy_version) (NONE->RUNNING)" state="PULLED" error="API error (404): repository xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name not found: does not exist or no pull access

more info needed scopECR

Source

L226

Most helpful comment

I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.

L226 on 1 Mar 2018

👍3

All 7 comments

Update - updating the AMI to ami-c1a6bda2 with docker version 17.03.2-ce and agent version 1.14.4 did not solve this issue.

L226 on 25 Aug 2017

@L226 We are sorry about that you're still experiencing this issue with the new docker. Also thanks for your step by step debugging, which will help our service team root cause the problem. Our service team is working a similar issue that may be related, in order to root cause the problem you are experiencing, extra information is required, can you share the following information to me at: penyin (at) amazon.com:

docker logs (preferably debug mode enabled) for the failure scenarios
account id
repository arn
image digest

Beyond that can you clarify the following questions?

"I can also successfully pull the suspect image locally."

Was the image pulled by cleaning all local docker information?

"I have also tried using a completely new ECR for the suspect image."

Did the image in the new repository fail in exactly the same way as the old repository or was the finding here a successful pull and task launch?

thanks

richardpen on 5 Sep 2017

Hi, thanks for your response. I'll get those emailed to you presently.

Successful local pull means that running the following on my laptop terminal works as expected:

docker pull xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version

No local docker info was cleaned.

Yes, the new ECR failed in exactly the same way as the old one.

L226 on 6 Sep 2017

@L226 Thanks for clarifying the questions. Have you sent the email, it looks like I didn't receive any email about this? Would you please check again, and my email is penyin (at) amazon.com. I'll let you know when I received all the information.

thanks,
Peng

richardpen on 6 Sep 2017

This seems to have been an issue with our CI workflow. Closing for now. Thanks for your help!

L226 on 12 Sep 2017

@L226 what was the CI workflow issue. Running into a similar issue on my setup. Wondering if you could elaborate.

dviramontes on 28 Feb 2018

👍2

I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.

L226 on 1 Mar 2018

👍3

Was this page helpful?

0 / 5 - 0 ratings

Related issues

container stopped immediately when run with ECS Task but stays run with 'docker run'

YurgenUA · 3Comments

Can not acquire network metric in EC 2/Bridge mode

hayajo · 3Comments

AWS Parameter Store for user specific secrets

pspanchal · 3Comments

Best practice to ship ECS agent logs to Cloudwatch Logs?

melo · 5Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments