Unhealthy ECS Agent 1.36.0 - seelog "too many open files" error results in instance no longer scheduling tasks
This issue affects the ecs-agent 1.36.0, and is not existent (at least in our environment) in 1.35.0.
After starting an instance and running for a while (varies based on container scheduling), the ecs-agent on the instance will become unhealthy. This happens consistently with approximately 100 stable containers running on the instance and approximately 5 unstable containers that die and are rescheduled regularly.
When this happens, the ecs-agent docker logs begin showing errors of this nature:
seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
seelog internal error: open /log/ecs-agent.log: too many open files
level=error time=2020-01-14T19:15:45Z msg="unable to setup cgroup root: cgroup resource [arn:aws:ecs:us-east-1:767904627276:task/distro/5fe0f75c96db4e9693f5e05586658d27]: setup cgroup: unable to create cgroup at /ecs/distro/5fe0f75c96db4e9693f5e05586658d27: cgroup create: unable to create controller: open /proc/self/mountinfo: too many open files" cgroupMountPath=/sys/fs/cgroup cgroupRoot=/ecs/distro/5fe0f75c96db4e9693f5e05586658d27 module=cgroup.go resourceName=cgroup taskARN=arn:aws:ecs:us-east-1:767904627276:task/distro/5fe0f75c96db4e9693f5e05586658d27
level=error time=2020-01-14T19:15:45Z msg="unable to setup cgroup root: cgroup resource [arn:aws:ecs:us-east-1:767904627276:task/distro/82178e3170364d918149c0f84576604e]: setup cgroup: unable to create cgroup at /ecs/distro/82178e3170364d918149c0f84576604e: cgroup create: unable to create controller: open /proc/self/mountinfo: too many open files" cgroupMountPath=/sys/fs/cgroup cgroupRoot=/ecs/distro/82178e3170364d918149c0f84576604e module=cgroup.go resourceName=cgroup taskARN=arn:aws:ecs:us-east-1:767904627276:task/distro/82178e3170364d918149c0f84576604el
At this point, any new containers that are scheduled to this instance become stuck in the "pending" state, and never transition to "running"
As is the case for us with 1.35.0, I expect the ecs-agent to remain healthy and for containers to not become stuck in the "pending" state.
The ecs-agent becomes unhealthy and includes seelog errors in the docker logs output. Containers on the affected instance become stuck in a "pending" state.
$ docker info
Containers: 286
Running: 141
Paused: 0
Stopped: 145
Images: 9
Server Version: 18.09.9-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.158-129.185.amzn2.x86_64
Operating System: Amazon Linux 2
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 124.8GiB
Name: ip-10-224-189-226.ec2.internal
ID: 5SOB:I63D:75AH:QTFV:7WQU:XISD:LV5Z:UHLG:A4NK:BCK3:JYNW:FLUK
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
$ curl http://localhost:51678/v1/metadata
{"Cluster":"distro","ContainerInstanceArn":"arn:aws:ecs:us-east-1:767904627276:container-instance/distro/85ea661fe4394a65b244ac78c4c057dc","Version":"Amazon ECS Agent - v1.36.0 (6cacbceb)"}
df -h
$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 63G 0 63G 0% /dev
tmpfs 63G 4.0K 63G 1% /dev/shm
tmpfs 63G 4.6M 63G 1% /run
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/nvme0n1p1 30G 2.7G 27G 10% /
/dev/nvme1n1 504G 3.4G 475G 1% /var/lib/docker
tmpfs 13G 0 13G 0% /run/user/13039
tmpfs 13G 0 13G 0% /run/user/0
$ cat /proc/sys/fs/file-max
13067984
$ ulimit -Hn
500000
$ ulimit -Sn
500000
Hi,
Thanks for reporting the issue. Could you run lsof on the instance to let us take a look at the output, so we can verify whether the too many open files error is caused by the agent? You can send the output to ecs-agent-external AT amazon.com if the output is very large. Thanks.
Hi,
We have reproduced this issue and we are actively working on fixing it.
One possible workaround is to use ECS Optimized AMI running Agent v1.35.0 until we get this issue fixed. See here.
We have released Agent v 1.36.1 with fix for this issue.
I can confirm, this appears to be fixed. Thanks for the quick turnaround!
Most helpful comment
Hi,
We have reproduced this issue and we are actively working on fixing it.