Is there a metric for when the agent kills a service due to hard memory limits being exceeded? I am trying to configure cloudwatch alarms for this scenario or the scenario where services can't be placed due to constraints and am having a hard time finding a set of metrics or aggregate metrics that describe this.
If these don't already exist, is it possible to get them added?
@devshorts,
Thanks for suggesting this! We don't currently report a metric for that event.
It's not exactly what you asked for, but you could see if the CloudWatch Event Stream addresses your use case. You could use these events to publish a custom metric if you want to setup CloudWatch alarms.
One clarification: the agent doesn't kill tasks for memory limits being exceeded. The kernel is what will actually kill an application when a hard memory limit is exceeded. Docker (and ECS) depend on functionality that is already built into Linux.
@petderek I see, the complexity here is that its hard to tell the difference between a kill and a scale down if we use the event stream to just find stops. We have metrics that log start/stop of containers already (since we hook into SIGTERM and log our own metric) but we can't determinsitically tell if we are either thrashing (i.e. the container is being started and stopped a lot) or just dynamically scaling
I don't have experience with it, but it looks like docker does have some sort of events it processes: https://docs.docker.com/engine/reference/commandline/events
Being able to capture those and put them into cloudwatch would be really killer
If it's possible to tag onto this request, having CW metrics for the number of running tasks and desired tasks would also be valuable, rather than correlating these through events.
moving this to the containers roadmap since this is a non agent-related feature req.
Most helpful comment
@petderek I see, the complexity here is that its hard to tell the difference between a kill and a scale down if we use the event stream to just find stops. We have metrics that log start/stop of containers already (since we hook into SIGTERM and log our own metric) but we can't determinsitically tell if we are either thrashing (i.e. the container is being started and stopped a lot) or just dynamically scaling
I don't have experience with it, but it looks like docker does have some sort of events it processes: https://docs.docker.com/engine/reference/commandline/events
Being able to capture those and put them into cloudwatch would be really killer