Amazon-ecs-agent: Containers stall at PENDING Stage

Created on 18 Aug 2016 · 10Comments · Source: aws/amazon-ecs-agent

Hi, I'm running into errors similar to issues #250 and #323 that were closed with fixes in > v1.8.1. I'm using an ECS AMI with v1.12.0 and getting the following from logs:

2016-08-18T15:55:48Z [INFO] No handler for message type: CloseMessage
2016-08-18T15:55:48Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-08-18T15:55:48Z [INFO] Creating poll dialer, host: ecs-a-1.us-west-2.amazonaws.com
2016-08-18T15:57:49Z [INFO] No handler for message type: CloseMessage
2016-08-18T15:57:49Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-08-18T15:57:50Z [INFO] Creating poll dialer, host: ecs-a-1.us-west-2.amazonaws.com
2016-08-18T15:59:51Z [INFO] No handler for message type: CloseMessage
2016-08-18T15:59:51Z [INFO] Error from acs; backing off, err: unexpected EOF
2016-08-18T15:59:52Z [INFO] Creating poll dialer, host: ecs-a-1.us-west-2.amazonaws.com
2016-08-18T15:59:56Z [INFO] Creating poll dialer, host: ecs-t-1.us-west-2.amazonaws.com
2016-08-18T15:59:57Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:395a8365d28c041def89fda0745940300d9b985b0b3a2a3bc6444b5bfcfb1859}"
2016-08-18T15:59:57Z [WARN] Error getting cpu stats module="stats" err="No data in the queue" container="&{DockerID:df8a1e7c486e0d408e10847b6db92a7bbe97359b6345bb23ff4608ef9dd94f74}"
2016-08-18T15:59:57Z [WARN] Error getting instance metrics: No task metrics to report

more info needed

Source

stepharr

All 10 comments

@theonestep4 Can you provide the ARNs of tasks that are stalled in PENDING as well as full logs from the instance? If you're uncomfortable posting this publicly, you can send these details to me by email at skarp (at) amazon.com.

samuelkarp on 19 Aug 2016

@theonestep4 Thank you for providing an ARN and logs by email. I took a look at the logs you sent and at our server logs and it appears that the server is attempting to send the task down to your agent, but the agent is failing to acknowledge the message within the timeout. Can you restart the agent with the log level set to debug (set ECS_LOGLEVEL=debug and sudo stop ecs; sudo start ecs) and the logs that generates?

samuelkarp on 22 Aug 2016

I got this issue too (PENDING Containers), not sure if related. One of my ecs agents died because out of disk, for some reason the logs are not getting rotated like other logs are. Took up ~8GB so far mostly with lines like

/var/log/ecs/ecs-agent.log.:

2016-08-21T15:00:00Z [WARN] Error retrieving stats for container 4f310a0289cd36343549b7aae6fca6682e68e06e4ccd2b49213536e0c7666062: dial unix /var/run/docker.sock: socket: too many open files

/var/log/docker:

time="2016-08-21T20:16:09.865420994Z" level=error msg="Failed to log msg \"2016-08-21T19:16:21Z [WARN] Error retrieving stats for container 93929447ea23a64b245bfe85da44801edf6b14ff2ec55f7bda8b683162a86769: dial unix /var/run/docker.sock: socket: too many open files\" for logger json-file: write /var/lib/docker/containers/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a-json.log: bad file descriptor"
time="2016-08-21T20:16:09.865442929Z" level=error msg="Failed to log msg \"2016-08-21T19:16:21Z [WARN] Error retrieving stats for container 53d9fea9be91570bd9156eab42bd63493cab4a76dbbcba3735282679ad9f500d: dial unix /var/run/docker.sock: socket: too many open files\" for logger json-file: write /var/lib/docker/containers/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a-json.log: bad file descriptor"
time="2016-08-21T20:16:09.865463086Z" level=error msg="Failed to log msg \"2016-08-21T19:16:21Z [WARN] Error retrieving stats for container 4f310a0289cd36343549b7aae6fca6682e68e06e4ccd2b49213536e0c7666062: dial unix /var/run/docker.sock: socket: too many open files\" for logger json-file: write /var/lib/docker/containers/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a/43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a-json.log: bad file descriptor"

MaerF0x0 on 22 Aug 2016

@MaerF0x0 That looks like https://github.com/aws/amazon-ecs-agent/issues/488. We'll be releasing a 1.12.1 agent that should address that problem.

@theonestep4 I took a look at the new logs you sent (thank you!) and it looks like the task in question successfully progressed out of PENDING and other tasks are now moving normally through the lifecycle. Although I didn't see any lines like @MaerF0x0's in the log files you provided, the initial timestamps that you reported were already rotated out and those WARN lines may have been occurring at that point in time. Since it looks like things are working right now, it's probably a good idea to upgrade to 1.12.1 when we release it and see if the problem recurs.

samuelkarp on 22 Aug 2016

👍1

@MaerF0x0 @theonestep4 Can you try 1.12.1?

samuelkarp on 25 Aug 2016

I tried updating the agent via the web ui and it didnt finish after about an hour. The agents were dead before it started and afaik new agent containers never started. I had to deploy stuff so I ended up just adding new instances to the cluster and deregistering the borked ones to move the loads over. Sadly the new AMI wasnt showing in the marketplace (apparently you have to go straight from the docs here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

MaerF0x0 on 25 Aug 2016

@MaerF0x0 Thank you for updating. I apologize, I should have mentioned that if you were experiencing that problem it's likely that updating through the console would have problems. I'm following up with our Marketplace team to get the new AMI listed there.

samuelkarp on 25 Aug 2016

We are experiences severe problems starting tasks and updating services.

The behaviour is really flaky, and I could not find a reliable reproduction. But it definitely happens after a certain lifetime of the ec2 container instance.

After we replace the whole ec2 container cluster, everything behaves quite well. But after some time, the agent does not properly start new tasks. Tasks stall in PENDING state.

What can I do to help debug this problem?

For example Task "f997a595-ed15-4e5b-8bf5-da5adf01cb97" just waits in PENDING.
When I go checkout the ec2 instance, it has not even started a docker process.

Tell me what I can do to help debug this issue. Currently it is halting all our deployments and thus endangers our live system.

I am running on us-east-1 with the ECS optimized AMI ami-3d55272a with agent version 1.12.2 and docker version 1.11.2.

Overbryd on 21 Sep 2016

@Overbryd I checked the volume statistics for the instance on which the task f997a595-ed15-4e5b-8bf5-da5adf01cb97 was scheduled to run. I can see that the IO throughput has dropped on this volume and the latencies have gone up. It seems like this gp2 volume has run out of IO credits and is unable to perform any IO. This, and your comment

But it definitely happens after a certain lifetime of the ec2 container instance.

leads me to believe that you're exhausting IO credits on the second 22 GiB volume attached to the instance. You could fix this issue by either choosing a bigger gp2 volume so that you speed up the rate of accumulating credits or moving to a provisioned iops volume. More information the same can be found here.

Since we have resolved the issue reported originally in this thread in the latest release of the ECS Agent, i'm closing this thread so that we do not conflate different issues. I have also created a new issue for behavior that you reported: #537. We can continue the discussion there.

aaithal on 21 Sep 2016

👍1

@Overbryd I have often found it useful to both pull and docker run the image on my development machine as well to do the same on an ECS host to debug issues with startup. Sometimes that helps. Seems @aaithal may have found your issue tho.

MaerF0x0 on 21 Sep 2016

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Can not acquire network metric in EC 2/Bridge mode

hayajo · 3Comments

Determine a proper timeout for LoadImage

aaithal · 3Comments

devicemapper leaking

cjbottaro · 4Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

Option to select multiple log-configuration

soumyasmruti · 5Comments