Yesterday we upgraded our cluster from amzn-ami-2016.03.c-amazon-ecs-optimized to the latest, amzn-ami-2016.03.g-amazon-ecs-optimized. At some point overnight, two of the instances in our cluster (out of ~6 in ASG) began flooding logs of this nature (hundreds per second):
Aug 17 07:21:17 Seelog error: open /log/ecs-agent.log.2016-08-17-14: too many open files
Aug 17 07:21:17 2016-08-17T14:21:17Z [WARN] Error retrieving stats for container bcbd3d6d2a51f656ec2066e62296010f5432262e1a564678325a60f3e642a575: dial unix /var/run/docker.sock: socket: too many open files
The two instances terminated without human interaction (not sure if that's a coincidence of auto-scaling). Near the end, these logs also appeared:
Aug 17 07:21:17 2016-08-17T14:21:17Z [CRITICAL] Error saving state before final shutdown module="TerminationHandler" err="Multiple error:
Aug 17 07:21:17 0: Timed out waiting for TaskEngine to settle
Aug 17 07:21:17 1: Timed out trying to save to disk"
We haven't experienced this on previous AMIs.
I think this is a duplicate of #478.
@jbergknoff Thanks for opening this. I've taken a look and I think this is actually a separate problem from #478.
I'mp experiencing the same problem in a productive environment and it caused an outage today. Any workaround or ETA for the fix? Thanks!
@jbergknoff @marprado I've just merged a pull request that I think should address this problem. If you're interested in testing the fix prior to our general release, please send me your AWS account ID and region by email at skarp (at) amazon.com.
Thanks, @jbergknoff ! Do you know when 1.12.1 will be published?
Thanks for getting to the bottom of this @samuelkarp. We rolled back to an older AMI a couple of days ago, so it would take some work for us to try out the new version, sorry. I'm also not sure we could reproduce the behavior reliably (it only occurred in 2 of 6 instances in our cluster).
We just experienced this exact same problem as well after upgrading to the amzn-ami-2016.03.g-amazon-ecs-optimized ami. The logs flooded so fast on one of our instances that they filled up the disk space to 100% within minutes.
Ditto, this let's hope this release gets published soon.
Just adding to the "we've just experienced the same problem" voices. This took out all of our custers :( I'm not even sure that downgrading the AMI would resolve the issue because it always pulls the latest version of the agent I believe...
yep, this was an expensive one
Any chance of an ETA on 1.12.1?
I am fairly certain this has also been effecting amzn-ami-2016.03.f (in addition to amzn-ami-2016.03.g mentioned above). Can Amazon advise on if there is a stable version of the ECS AMI that does not contain this issue?
i tried restarting my agent (sudo start ecs)
and got issues with a missing "entity" ?
/var/log/ecs-init.log.2016-08-22-18
2016-08-22T18:48:03Z [INFO] pre-start
2016-08-22T18:48:03Z [INFO] start
2016-08-22T18:48:03Z [INFO] Container name: /ecs-database-retriever-hipri-3-worker-1-b4aaf3ecb6ccb9a0dc01
2016-08-22T18:48:03Z [INFO] Container name: /ecs-database-retriever-hipri-3-worker-1-d69bb8c69bd0c9ffeb01
2016-08-22T18:48:03Z [INFO] Container name: /ecs-database-retriever-18-worker-1-82cbfbf7eefacbc74f00
2016-08-22T18:48:03Z [INFO] Container name: /ecs-database-retriever-18-worker-1-80acc690ec90f3f8bf01
2016-08-22T18:48:03Z [INFO] Container name: /ecs-pushapi-1-pushapi-f0dadcb5c0b1b189a301
2016-08-22T18:48:03Z [INFO] Container name: /ecs-streampoints-api-9-streampoints-api-a2bafd8cc9d7e8a2f801
2016-08-22T18:48:03Z [INFO] No existing agent container to remove.
2016-08-22T18:48:03Z [INFO] Starting Amazon EC2 Container Service Agent
2016-08-22T18:48:03Z [ERROR] could not start Agent: API error (500): Could not find container for entity id 43f4e5778211381f46107aa998148a693b19b4377234e53997751d80fe22053a
We've just released 1.12.1, which should fix this issue. Please let us know if you continue to run into problems.
How long until a new ECS-Optimized AMI is out, for those of us who use that AMI as a base image?
@ziggythehamster The new ECS AMI is amzn-ami-2016.03.h-amazon-ecs-optimized. We'll be updating our documentation shortly.
@samuelkarp the amzn-ami-2016.03.h-amazon-ecs-optimized doesnt show up in my marketplace. tried searches for "ecs" and "2016.03.h" .
@MaerF0x0 We're working on getting the Marketplace listing updated, but in the meantime the latest AMI IDs are available in our documentation.
Hello,
AMI: amzn-ami-2016.03.h-amazon-ecs-optimized (ami-078df974)
/var/log/docker is 3.2G in a few days ;-( I can see in log plenty of
time="2016-09-01T21:14:07.047924749Z" level=error msg="Handler for GET /v1.17/containers/5ec384558f98a760f3adb6c2a1490875667d1fdb7644701880a008bc2af5c64f/json returned error: No such container: 5ec384558f98a760f3adb6c2a1490875667d1fdb7644701880a008bc2af5c64f"
Looks like there is no log rotation in the image?
Thanks you,
@ebuildy we released the v1.12.2 version of the ECS Agent today which address the issue that you're seeing. Feel free to create a new issue if you continue seeing this issue.
Thanks,
Anirudh
Most helpful comment
@ziggythehamster The new ECS AMI is
amzn-ami-2016.03.h-amazon-ecs-optimized. We'll be updating our documentation shortly.