Amazon-ecs-agent: Errors retrieving container stats

Created on 30 Aug 2016  路  12Comments  路  Source: aws/amazon-ecs-agent

(2016.03.h AMI)

Last night we got many ECS agent logs like:

2016-08-30T08:41:56Z [WARN] Error retrieving stats for container fade86e2623dc76d877578e73667d75c65b78cf5a137022ba0ceaa835e1bfc0b: io: read/write on closed pipe

and, a little while later:

2016-08-30T08:51:31Z [WARN] Error retrieving stats for container 7da93fc2affede5f2f2dedd591001802cd35253bacb56d56d463a7a8c93e7756: inactivity time exceeded timeout

When I logged on this morning and tried to inspect the instance, the Docker daemon was unresponsive (e.g. to docker ps). The Docker logs were flooding:

time="2016-08-30T14:42:47.628002008Z" level=error msg="collecting stats for 348c1509a294cf3b140467c06af0880b126941b78c34161cc61ad5e3da7e844a: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.628199094Z" level=error msg="collecting stats for 5f1f4d523eddec6a892f69ae98dd5f35a67accbcb80a381b38b8904e73feb217: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.628390068Z" level=error msg="collecting stats for aeb2638531a9a939ee87eb5489f0b4524ca95a73efe5ac7d0f7e96d350b35270: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.628597549Z" level=error msg="collecting stats for 379322c6cca9bf644a201147ed366a04f859db07754e304c3620114073942d64: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.628798425Z" level=error msg="collecting stats for 08ef510913ceee4d14778141af1e6a1d6f84eb22cbec324f321414b4e0059075: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.628990198Z" level=error msg="collecting stats for dfdf5b419a7c8a2e503f13ac43131223cf84dc4dd87aa518aea79fd0b25bb33d: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.629199128Z" level=error msg="collecting stats for f35c92140b27cc74e609e5e73e9f564d1347e6432feabb4c979281b611f0ca54: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.629391177Z" level=error msg="collecting stats for ed6914c31d1bcb2cf852b224531abf11cfc9c89d80dfb52948918d76fcbdfb1c: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.629627844Z" level=error msg="collecting stats for 4afceb4e32926b4d95487ccb0555d73296b5f4682b4ceb5536183a0ae31f8622: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.629820202Z" level=error msg="collecting stats for 5250680e6a2db828a27acd33377dc366e9156833098e4057c326300d25d06e43: rpc error: code = 2 desc = \"containerd: container not found\"" 
time="2016-08-30T14:42:47.630019263Z" level=error msg="collecting stats for 7da93fc2affede5f2f2dedd591001802cd35253bacb56d56d463a7a8c93e7756: rpc error: code = 2 desc = \"containerd: container not found\""

(note the timestamps: this is a tiny sample of even that one second) and the ECS agent logs were again showing the "inactivity time exceeded timeout" errors.

kinbug

Most helpful comment

We have released the v1.12.2 version of the ECS Agent today, which should fix this issue. Please let us know if you continue running into this. Closing this issue for now. Feel free to reach out to us if you have any questions/comments/feedback.

Thanks,
Anirudh

All 12 comments

@jbergknoff Do you have the log lines (especially form the Docker daemon) from immediately before those errors started flooding? I'm suspecting something went a bit sideways with containerd and hoping that the immediately prior lines would give some insight into why.

Sorry, I don't. I threw out the instances shortly after I filed this issue. If this happens again, I'll try to grab that information.

Hi guys, I faced the this problem of latest ECS optimized AMI 2016.03.h. I noticed that it starts when there's a lot of unix socket connections opened to /var/run/docker.sock (I check it by command netstat -x and export metrics to datadog). Usually there's about 8 opened connections, when ECS agent starts to fail there's 90-100 connections. After restarting docker service and ecs agent, it starts working again.

@samuelkarp There's nothing suspicious in /var/log/docker that could show the reason of the problem. Please see https://gist.github.com/anosulchik/49b4d5203a86641bebd10e609488aae3

Guys, let me know if you need some further input for troubleshooting.

@jbergknoff: I was able to repro 'io closed pipe' error when a container is forcefully removed (docker rm -f). Is this something that could have happened on your instance? It looks like one can ignore ErrClosedPipe err from docker-stats stream, because agent tries to reconnect to docker-stats. Regarding the 'inactivity exceeded timeout' one scenario I am able to repro is when docker-stats is not working or too slow, for e.g.

time="2016-09-01T16:18:07.461053881Z" level=error msg="collecting stats for 29836ba21d8b6e0a97b368e48a0f161e9f8c5d220840bf724e09202d162ba27a: failed to retrieve the statistics for eth0 in netns /var/run/docker/netns/2a7166109f7d: failure opening /proc/net/dev: fork/exec /bin/cat: cannot allocate memory"

docker stats returns

CONTAINER           CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O             PIDS
29836ba21d8b        0.00%               0 B / 0 B             0.00%               0 B / 0 B           0 B / 0 B             0

It seems like some issue with the instance (cpu/memory) or docker-daemon can have adverse effect on metrics reporting. My repro included a stress container to hog memory on the host. I am sure there are other ways to get into this situation. So in essence a single bad container can DOS other containers from getting metrics information. Is this something that may be happening in your case?

@anosulchik: Can you describe, what you meant by when ECS agent starts to fail? and any repro steps. Seems like a connection leak to me. Would like to understand more about your setup before diving into it. Thanks

-kiran

@kiranmeduri thank you for investigating. In the situation I reported above, there hadn't been any containers forcefully removed. All container orchestration was via the ECS agent.

It is likely that one or more containers in our environment is hogging CPU (memory is less likely), and effectively DoSing other containers or the Docker daemon. Do you have any recommendations for avoiding that situation? For instance, if ECS exposed Docker's CPU quota option, to put an upper bound on a container's CPU usage, that may work (#369). Maybe there are other approaches that are feasible right now.

@kiranmeduri Unfortunately there's no clear reproduction steps which could be taken to hit the problem. It happens after 10-24 hours after new EC2 instance runs containers orchestrated by ECS agent. The only change that we did was upgrading AMI from ami-cb2305a1 to latest ami-6bb2d67c. Please see our task definition here https://gist.github.com/anosulchik/29312f8b30fa6f88b9c997281d32d822

Coming to _when ECS agent starts to fail_. It means that ECS agent starts producing connections to unix socket /var/run/docker.sock -- we noticed correlation between number of opened connections and ECS agent hang. As soon as number of connections starts to grow ECS agent becomes unresponsive to ECS service -- it can't start tasks anymore so they hangs in PENDING state (if there's no need to start new tasks, existing ones remains running without issues). We also see logs errors mentioned @jbergknoff in ECS agent's log. One more observation docker ps hangs forever until I restart it (it hangs again if I start ECS agent).

Here's stack dump of ECS agent that hanged: https://gist.github.com/anosulchik/1b62b3cc125d77bcf2289270fdbd1bed

And its logs: https://gist.github.com/anosulchik/36f0baca5b6301bd0c3863505dda3c11

+1, had same problem.

Both dockerd's log in /var/log/docker and the ecs-agent container log in /var/lib/docker/containers grew to 3.5gb. Didnt notice until the disk was full and ecs actions were unresponsive.

A quick grep test shows that 99.4% of the container log contain lines like:

{"log":"2016-09-09T22:11:52Z [WARN] Error retrieving stats for container 5c3a7671a85c777cc2ed94c3bd3d6b540579ec208807bf437ed143acd05ae41d: No such container: 5c3a7671a85c777cc2ed94c3bd3d6b540579ec208807bf437ed143acd05ae41d\n","stream":"stdout","time":"2016-09-09T22:11:52.007353052Z"}

and 99.x% of the dockerd log with the corresponding

time="2016-09-09T22:18:44.085156456Z" level=error msg="Handler for GET /v1.17/containers/d28cf8b88010bd03f604633fa42d476368a769729ff06711970ea8b4105aa869/stats returned error: No such container: d28cf8b88010bd03f604633fa42d476368a769729ff06711970ea8b4105aa869"

I've been running 2 services across 2 t2.smalls with the latest ecs-optimized-linux ami-6bb2d67c. One service stayed up and the other died.

In my case I definitely go onto the node instances and run client commands from there. I've run my own containers there manually and stopped/rm'ed a container or two. Usually ones that seem stuck, either old ECS service tasks permanently in "STOPPING" or containers that are running but that dont appear as tasks in the api/console.

@clumsysnake @anosulchik @jbergknoff Thanks for providing these details. We found an issue with our stats collector, where if it misses the 'container stopped' event, it continues to spin trying to collect metrics for the stopped container, which would result the log line filling up the disk and the increased CPU usage.

We just merged the fix for this into the dev branch and are working on getting this released as well (#525). We'll update the thread when we have more updates for this.

Meanwhile, if you want to try out the fix from the dev branch, please share your AWS account IDs by sending an email to aithal at amazon dot com and I can share the preview build via an EC2 Container Registry repository.

Thanks,
Anirudh

We have released the v1.12.2 version of the ECS Agent today, which should fix this issue. Please let us know if you continue running into this. Closing this issue for now. Feel free to reach out to us if you have any questions/comments/feedback.

Thanks,
Anirudh

Thanks @aaithal I'm going to test new version and will ask to reopen this issue if necessary.

@aaithal thanks for the fast response. This problem just re-occurred. Pulled down the :latest image just now and will let you know if it happens again.

FWIW we upgraded our cluster to agent v1.12.2 yesterday afternoon and haven't seen io: read/write on closed pipe since then. It was happening (at least) a few times per hour prior to the upgrade. Thanks for the quick work!

Was this page helpful?
0 / 5 - 0 ratings