Amazon-ecs-agent: Agent fails to start ("unable to get vpc id from instance metadata")

Created on 15 Feb 2019  路  15Comments  路  Source: aws/amazon-ecs-agent

Summary

I am attempting to add container instances to an existing cluster. The instances never join the cluster. The ECS agent logs indicate a 404 when trying to fetch the VPC ID from the metadata service. However, these instances were not launched in a VPC and reside in EC2-Classic.

Description

I've tried the following AMIs:

  • amzn-ami-2018.03.m-amazon-ecs-optimized (ami-0796380bc6e51157f)
  • amzn2-ami-ecs-hvm-2.0.20190204-x86_64-ebs (ami-032564940f9afd5c0)

My /etc/ecs/ecs.config contains this:

ECS_CLUSTER=thunder
ECS_LOGLEVEL=debug

Expected Behavior

Agent starts, and instances join cluster.

Observed Behavior

Agent never starts; instances do not join cluster. Agent logs contain the following:

2019-02-14T23:29:46Z [INFO] Amazon ECS agent Version: 1.25.2, Commit: 0821fbc7
2019-02-14T23:29:46Z [DEBUG] Loaded config: Cluster: thunder,  Region: us-east-1,  DataDir: /data, Checkpoint: true, AuthType: , UpdatesEnabled: true, DisableMetrics: false, PollMetrics: false, PollingMetricsWaitDuration: 15s, ReservedMem: 0, TaskCleanupWaitDuration: 3h0m0s, DockerStopTimeout: 30s, ContainerStartTimeout: 3m0s, TaskCPUMemLimit: 3, , PauseContainerImageName: amazon/amazon-ecs-pause, PauseContainerTag: 0.1.0
2019-02-14T23:29:46Z [INFO] Creating root ecs cgroup: /ecs
2019-02-14T23:29:46Z [INFO] Creating cgroup /ecs
2019-02-14T23:29:46Z [INFO] Loading state! module="statemanager"
2019-02-14T23:29:46Z [INFO] Event stream ContainerChange start listening...
2019-02-14T23:29:46Z [CRITICAL] Unable to initialize Task ENI dependencies: unable to get vpc id from instance metadata: EC2MetadataError: failed to make EC2Metadata request
caused by: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

Environment Details

  • Instance is in EC2-Classic, with public IP. Hence unclear why fetching VPC ID.
  • IAM role is ecsInstanceRole with a single AWS-managed policy: AmazonEC2ContainerServiceforEC2Role

Supporting Log Snippets

See below (zip format to make GitHub happy)

kinbug pending release scopECS Agent workaround available

Most helpful comment

Hey @kian

I did a quick dive into the problem and it looks like there's a bug in how agent detects classic ec2 instances. This really should degrade gracefully instead of failing like this -- and thats something we will need to fix on our end.

That said, you may be able to avoid this code path by adding the following to your ecs.config file:

ECS_ENABLE_TASK_ENI=false

All 15 comments

ecs-logs-collector bundle: collect.zip

can you check if you have the trust relationship setup from this doc also?
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html

@suneyz Verified:

{
  "Version": "2008-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Not sure if it's relevant here, but the 4 existing instances that are currently part of the cluster are running agent 1.14.1 and docker 1.12.6.

took a look into the iptable log you provided, looks like you might miss the setup for iptables, specifically you are missing the route to port 51679. Can you try to setup again the iptable rule from this instruction?
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html

Note you you might want to follow the steps to setup manually
To install the Amazon ECS container agent on a non-Amazon Linux EC2 instance -> step 5-7

I'm confused - these are the Amazon Linux ECS-optimized AMIs, so why do I need custom configuration? Have I selected the wrong AMI?

@kian You should not need any special setup.

@suneyz This sounds like a bug in the ECS agent. @kian is running in EC2-Classic, which means the instance is not running inside a VPC. The agent should tolerate the lack of a VPC ID and disable features that depend on it (like awsvpc network mode).

(note: nothing here depends on trust relationships or iptables rules)

Thanks guys - we're in the process of addressing the recent runc security hole (hence the need to roll out updates), so I'm happy to test any patches or provide more information.

@kian If this is blocking you, please note that you can upgrade Docker (and runc) from the Amazon Linux yum repositories without updating the ECS agent. You can do so by running sudo yum update docker. See ALAS-2019-1156 and the AWS security bulletin for more information.

perfect, I'll give it a shot. thanks for the tip.

Looks like docker has a dependency on ecs-init, so yum update docker upgraded both. same error is now happening on the old 2016.09.g instance I upgraded in-place.

2019-02-15T01:25:23Z [INFO] pre-start
2019-02-15T01:25:23Z [INFO] start
2019-02-15T01:25:23Z [INFO] Container name: /ecs-agent
2019-02-15T01:25:23Z [INFO] Removing existing agent container ID: b088183bd5bf538fbd545119416871da120a80d26c0d1d39c1f7dd180bc5e6c4
2019-02-15T01:25:23Z [INFO] Starting Amazon Elastic Container Service Agent
2019-02-15T01:25:24Z [INFO] Agent exited with code 5
2019-02-15T01:25:24Z [ERROR] agent exited with terminal exit code
2019-02-15T01:25:24Z [INFO] post-stop
2019-02-15T01:25:24Z [INFO] Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent

sounds like I will need to hold off until the agent is fixed?

Hey @kian

I did a quick dive into the problem and it looks like there's a bug in how agent detects classic ec2 instances. This really should degrade gracefully instead of failing like this -- and thats something we will need to fix on our end.

That said, you may be able to avoid this code path by adding the following to your ecs.config file:

ECS_ENABLE_TASK_ENI=false

@petderek nice, looks like that workaround did the trick. The agent starts up successfully, and the instance joins the cluster as expected.

closing, released with version v1.32.1

Was this page helpful?
0 / 5 - 0 ratings

Related issues

taktakpeops picture taktakpeops  路  4Comments

melo picture melo  路  5Comments

flowirtz picture flowirtz  路  5Comments

pspanchal picture pspanchal  路  3Comments

sparrc picture sparrc  路  4Comments