Amazon-ecs-agent: CannotPullContainerError, ECS Service task fails to start

Created on 24 Aug 2017  路  7Comments  路  Source: aws/amazon-ecs-agent

So here's a funny one, repeated CannotPullContainerError: API error (404) for one of our services. We have multiple containers running in production, attempted to deploy another today using our CI stack. Keep getting the above error even though we're using the exact same process for our existing services.

  1. Application docker image built on CI stack
  2. Image pushed to ECR and tagged with the deploy version
  3. CI stack crafts task definition from provided JSON skeleton, secret envvars from S3, terraform outputs and deploy version
  4. New task definition sent to ECS service
  5. ECS attempts to run the task with latest definition
  6. Repeated failures to pull

We have a framework to handle all of this, allowing us to drop in new service app code and just push onto new ECS services. This system has been working fine so far....

Things I have tried:

  • running the 'bad' task definition on a different ECS cluster - fails to deploy
  • running a known good task definition on the 'bad' ECS cluster - task runs successfully
  • manually crafting task definition JSON identical to the 'bad' one but pointing to a known good image - task runs successfully
  • manually pushing a known good image tagged with the deploy version of the 'bad' task definition to that task's ECR - task fails to deploy
  • terminating the EC2 instance and redeploying the ECS service task - still fails to deploy
  • using a bastion to ssh into the instance and trying a manual pull of suspect image - fails
  • using a bastion to ssh into the instance and trying a manual pull of some other image (alpine) - success

I can also successfully pull the suspect image locally. I have also tried using a completely new ECR for the suspect image.

The suspect image is only 1.27GB, being pulled onto an m3.medium with nothing else running on it.

Is this an issue with the docker image (runs fine locally and on CI), with the task definition (runs fine for other images) or with AWS (?) ?

Our ECS clusters and services are managed with terraform - we have other systems running with the exact same configuration. IAM access is setup correctly.

Versions:

  • ECS AMI - ami-c3233ba0
  • docker - 17.03.1-ce
  • ecs-agent - 1.14.3
  • terraform - v0.9.5
  • instance - m3.medium

The app is python 2.7.

Any suggestions?

task def:

{
  "requiresAttributes": [
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.logging-driver.syslog",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.ecr-auth",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.task-iam-role",
      "targetId": null,
      "targetType": null
    },
    {
      "value": null,
      "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19",
      "targetId": null,
      "targetType": null
    }
  ],
  "taskDefinitionArn": "arn:aws:ecs:ap-southeast-2:123456789:task-definition",
  "networkMode": "bridge",
  "status": "ACTIVE",
  "revision": 11,
  "taskRoleArn": "arn:aws:iam::123456789:role/ECSRole",
  "containerDefinitions": [
    {
      "volumesFrom": [],
      "memory": 2048,
      "extraHosts": null,
      "dnsServers": null,
      "disableNetworking": null,
      "dnsSearchDomains": null,
      "portMappings": [],
      "hostname": null,
      "essential": true,
      "entryPoint": null,
      "mountPoints": [],
      "name": "appname",
      "ulimits": null,
      "dockerSecurityOptions": null,
      "environment": [
        {
          "name": "AWS_REGION",
          "value": "ap-southeast-2"
        }
      ],
      "links": null,
      "workingDirectory": null,
      "readonlyRootFilesystem": null,
      "image": "xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version",
      "command": null,
      "user": null,
      "dockerLabels": null,
      "logConfiguration": {
        "logDriver": "syslog",
        "options": {
          "syslog-address": "udp://logs5.papertrailapp.com:xxxxx"
        }
      },
      "cpu": 1024,
      "privileged": null,
      "memoryReservation": null
    }
  ],
  "placementConstraints": [],
  "volumes": [],
  "family": "appname"
}

ecs-agent log:

2017-08-24T08:06:50Z [INFO] Error transitioning container module="TaskEngine" task="appname:11 arn:aws:ecs:ap-southeast-2:123456789:task/717fe788-eadb-4cfa-a33c-d9528b60834b, Status: (NONE->RUNNING) Containers: [appname (NONE->RUNNING),]" container="appname(xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name:deploy_version) (NONE->RUNNING)" state="PULLED" error="API error (404): repository xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/image_name not found: does not exist or no pull access
more info needed scopECR

Most helpful comment

I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.

All 7 comments

Update - updating the AMI to ami-c1a6bda2 with docker version 17.03.2-ce and agent version 1.14.4 did not solve this issue.

@L226 We are sorry about that you're still experiencing this issue with the new docker. Also thanks for your step by step debugging, which will help our service team root cause the problem. Our service team is working a similar issue that may be related, in order to root cause the problem you are experiencing, extra information is required, can you share the following information to me at: penyin (at) amazon.com:

  • docker logs (preferably debug mode enabled) for the failure scenarios
  • account id
  • repository arn
  • image digest

Beyond that can you clarify the following questions?

"I can also successfully pull the suspect image locally."

Was the image pulled by cleaning all local docker information?

"I have also tried using a completely new ECR for the suspect image."

Did the image in the new repository fail in exactly the same way as the old repository or was the finding here a successful pull and task launch?

thanks

Hi, thanks for your response. I'll get those emailed to you presently.

Successful local pull means that running the following on my laptop terminal works as expected:

docker pull xxxxxxxxxxxxx.dkr.ecr.ap-southeast-2.amazonaws.com/repo_name:deploy_version

No local docker info was cleaned.

Yes, the new ECR failed in exactly the same way as the old one.

@L226 Thanks for clarifying the questions. Have you sent the email, it looks like I didn't receive any email about this? Would you please check again, and my email is penyin (at) amazon.com. I'll let you know when I received all the information.

thanks,
Peng

This seems to have been an issue with our CI workflow. Closing for now. Thanks for your help!

@L226 what was the CI workflow issue. Running into a similar issue on my setup. Wondering if you could elaborate.

I believe it was an IAM issue, the ECR was not created using our (now standard) workflow (terraform) and thus some other systems did not have the correct IAM privileges to interact with it.

Was this page helpful?
0 / 5 - 0 ratings