Amazon-ecs-agent: ECS Agent 1.36.0 becomes unhealthy, resulting in tasks stuck in pending state

Created on 14 Jan 2020  路  5Comments  路  Source: aws/amazon-ecs-agent

Summary

Unhealthy ECS Agent 1.36.0 - seelog "too many open files" error results in instance no longer scheduling tasks

Description

This issue affects the ecs-agent 1.36.0, and is not existent (at least in our environment) in 1.35.0.

After starting an instance and running for a while (varies based on container scheduling), the ecs-agent on the instance will become unhealthy. This happens consistently with approximately 100 stable containers running on the instance and approximately 5 unstable containers that die and are rescheduled regularly.

When this happens, the ecs-agent docker logs begin showing errors of this nature:

seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
level=error time=2020-01-14T19:15:45Z msg="unable to setup cgroup root: cgroup resource [arn:aws:ecs:us-east-1:767904627276:task/distro/5fe0f75c96db4e9693f5e05586658d27]: setup cgroup: unable to create cgroup at /ecs/distro/5fe0f75c96db4e9693f5e05586658d27: cgroup create: unable to create controller: open /proc/self/mountinfo: too many open files" cgroupMountPath=/sys/fs/cgroup cgroupRoot=/ecs/distro/5fe0f75c96db4e9693f5e05586658d27 module=cgroup.go resourceName=cgroup taskARN=arn:aws:ecs:us-east-1:767904627276:task/distro/5fe0f75c96db4e9693f5e05586658d27
level=error time=2020-01-14T19:15:45Z msg="unable to setup cgroup root: cgroup resource [arn:aws:ecs:us-east-1:767904627276:task/distro/82178e3170364d918149c0f84576604e]: setup cgroup: unable to create cgroup at /ecs/distro/82178e3170364d918149c0f84576604e: cgroup create: unable to create controller: open /proc/self/mountinfo: too many open files" cgroupMountPath=/sys/fs/cgroup cgroupRoot=/ecs/distro/82178e3170364d918149c0f84576604e module=cgroup.go resourceName=cgroup taskARN=arn:aws:ecs:us-east-1:767904627276:task/distro/82178e3170364d918149c0f84576604el

At this point, any new containers that are scheduled to this instance become stuck in the "pending" state, and never transition to "running"

Expected Behavior

As is the case for us with 1.35.0, I expect the ecs-agent to remain healthy and for containers to not become stuck in the "pending" state.

Observed Behavior

The ecs-agent becomes unhealthy and includes seelog errors in the docker logs output. Containers on the affected instance become stuck in a "pending" state.

Environment Details

$ docker info
Containers: 286
 Running: 141
 Paused: 0
 Stopped: 145
Images: 9
Server Version: 18.09.9-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host ipvlan macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.14.158-129.185.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 124.8GiB
Name: ip-10-224-189-226.ec2.internal
ID: 5SOB:I63D:75AH:QTFV:7WQU:XISD:LV5Z:UHLG:A4NK:BCK3:JYNW:FLUK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
$ curl http://localhost:51678/v1/metadata
{"Cluster":"distro","ContainerInstanceArn":"arn:aws:ecs:us-east-1:767904627276:container-instance/distro/85ea661fe4394a65b244ac78c4c057dc","Version":"Amazon ECS Agent - v1.36.0 (6cacbceb)"}
df -h
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         63G     0   63G   0% /dev
tmpfs            63G  4.0K   63G   1% /dev/shm
tmpfs            63G  4.6M   63G   1% /run
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/nvme0n1p1   30G  2.7G   27G  10% /
/dev/nvme1n1    504G  3.4G  475G   1% /var/lib/docker
tmpfs            13G     0   13G   0% /run/user/13039
tmpfs            13G     0   13G   0% /run/user/0
$ cat /proc/sys/fs/file-max
13067984
$ ulimit -Hn
500000
$ ulimit -Sn
500000
kinbug

Most helpful comment

Hi,
We have reproduced this issue and we are actively working on fixing it.

All 5 comments

Hi,
Thanks for reporting the issue. Could you run lsof on the instance to let us take a look at the output, so we can verify whether the too many open files error is caused by the agent? You can send the output to ecs-agent-external AT amazon.com if the output is very large. Thanks.

Hi,
We have reproduced this issue and we are actively working on fixing it.

One possible workaround is to use ECS Optimized AMI running Agent v1.35.0 until we get this issue fixed. See here.

We have released Agent v 1.36.1 with fix for this issue.

I can confirm, this appears to be fixed. Thanks for the quick turnaround!

Was this page helpful?
0 / 5 - 0 ratings