I am attempting to add container instances to an existing cluster. The instances never join the cluster. The ECS agent logs indicate a 404 when trying to fetch the VPC ID from the metadata service. However, these instances were not launched in a VPC and reside in EC2-Classic.
I've tried the following AMIs:
My /etc/ecs/ecs.config contains this:
ECS_CLUSTER=thunder
ECS_LOGLEVEL=debug
Agent starts, and instances join cluster.
Agent never starts; instances do not join cluster. Agent logs contain the following:
2019-02-14T23:29:46Z [INFO] Amazon ECS agent Version: 1.25.2, Commit: 0821fbc7
2019-02-14T23:29:46Z [DEBUG] Loaded config: Cluster: thunder, Region: us-east-1, DataDir: /data, Checkpoint: true, AuthType: , UpdatesEnabled: true, DisableMetrics: false, PollMetrics: false, PollingMetricsWaitDuration: 15s, ReservedMem: 0, TaskCleanupWaitDuration: 3h0m0s, DockerStopTimeout: 30s, ContainerStartTimeout: 3m0s, TaskCPUMemLimit: 3, , PauseContainerImageName: amazon/amazon-ecs-pause, PauseContainerTag: 0.1.0
2019-02-14T23:29:46Z [INFO] Creating root ecs cgroup: /ecs
2019-02-14T23:29:46Z [INFO] Creating cgroup /ecs
2019-02-14T23:29:46Z [INFO] Loading state! module="statemanager"
2019-02-14T23:29:46Z [INFO] Event stream ContainerChange start listening...
2019-02-14T23:29:46Z [CRITICAL] Unable to initialize Task ENI dependencies: unable to get vpc id from instance metadata: EC2MetadataError: failed to make EC2Metadata request
caused by: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>404 - Not Found</title>
</head>
<body>
<h1>404 - Not Found</h1>
</body>
</html>
ecsInstanceRole with a single AWS-managed policy: AmazonEC2ContainerServiceforEC2RoleSee below (zip format to make GitHub happy)
ecs-logs-collector bundle: collect.zip
can you check if you have the trust relationship setup from this doc also?
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
@suneyz Verified:
{
"Version": "2008-10-17",
"Statement": [
{
"Sid": "",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Not sure if it's relevant here, but the 4 existing instances that are currently part of the cluster are running agent 1.14.1 and docker 1.12.6.
took a look into the iptable log you provided, looks like you might miss the setup for iptables, specifically you are missing the route to port 51679. Can you try to setup again the iptable rule from this instruction?
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-install.html
Note you you might want to follow the steps to setup manually
To install the Amazon ECS container agent on a non-Amazon Linux EC2 instance -> step 5-7
I'm confused - these are the Amazon Linux ECS-optimized AMIs, so why do I need custom configuration? Have I selected the wrong AMI?
@kian You should not need any special setup.
@suneyz This sounds like a bug in the ECS agent. @kian is running in EC2-Classic, which means the instance is not running inside a VPC. The agent should tolerate the lack of a VPC ID and disable features that depend on it (like awsvpc network mode).
(note: nothing here depends on trust relationships or iptables rules)
Thanks guys - we're in the process of addressing the recent runc security hole (hence the need to roll out updates), so I'm happy to test any patches or provide more information.
@kian If this is blocking you, please note that you can upgrade Docker (and runc) from the Amazon Linux yum repositories without updating the ECS agent. You can do so by running sudo yum update docker. See ALAS-2019-1156 and the AWS security bulletin for more information.
perfect, I'll give it a shot. thanks for the tip.
Looks like docker has a dependency on ecs-init, so yum update docker upgraded both. same error is now happening on the old 2016.09.g instance I upgraded in-place.
2019-02-15T01:25:23Z [INFO] pre-start
2019-02-15T01:25:23Z [INFO] start
2019-02-15T01:25:23Z [INFO] Container name: /ecs-agent
2019-02-15T01:25:23Z [INFO] Removing existing agent container ID: b088183bd5bf538fbd545119416871da120a80d26c0d1d39c1f7dd180bc5e6c4
2019-02-15T01:25:23Z [INFO] Starting Amazon Elastic Container Service Agent
2019-02-15T01:25:24Z [INFO] Agent exited with code 5
2019-02-15T01:25:24Z [ERROR] agent exited with terminal exit code
2019-02-15T01:25:24Z [INFO] post-stop
2019-02-15T01:25:24Z [INFO] Cleaning up the credentials endpoint setup for Amazon Elastic Container Service Agent
sounds like I will need to hold off until the agent is fixed?
Hey @kian
I did a quick dive into the problem and it looks like there's a bug in how agent detects classic ec2 instances. This really should degrade gracefully instead of failing like this -- and thats something we will need to fix on our end.
That said, you may be able to avoid this code path by adding the following to your ecs.config file:
ECS_ENABLE_TASK_ENI=false
@petderek nice, looks like that workaround did the trick. The agent starts up successfully, and the instance joins the cluster as expected.
closing, released with version v1.32.1
Most helpful comment
Hey @kian
I did a quick dive into the problem and it looks like there's a bug in how agent detects classic ec2 instances. This really should degrade gracefully instead of failing like this -- and thats something we will need to fix on our end.
That said, you may be able to avoid this code path by adding the following to your ecs.config file: