Hello,
we're running three ECS clusters and all of them show this behavior occasionally. See the images below:


The workload on these clusters is fairly stable and we only occasionally deploy new versions of containers (i.e. updating task definitions and services), but we're not adding or removing any services. It only ever starts or stops after deployments, though it's not always the same service causing/fixing it. We also hardly ever adjust resource reservations for the services.
All services work normally, so the issue seems to not affect scheduling. We do use the metrics to trigger auto scaling actions, though, so these get very flaky, too. Also completely replacing the instances doesn't seem to get rid of the problem.
The only abnormal things in the logs seem to be lines like these:
[INFO] Redundant container state change for task, Status: (RUNNING->RUNNING) Containers: [(RUNNING->RUNNING),]: (RUNNING->RUNNING) to RUNNING, but already RUNNING
Any idea what might be causing this?
Hi @ligustah, thank you for reporting this issue. I have made the ECS service team aware of this issue. Can you please provide us the cluster ARN to help us debug this issue (cluster name + AWS account ID + region would also be sufficient)? You can email it to aithal at amazon dot com if you prefer. Alternatively, you can also engage AWS support as we like to use this repository to work on ECS Agent issues and this seems related to how ECS computes metrics.
Thanks,
Anirudh
Hi @aaithal ,
I just sent a mail to the address you mention with the information you requested.
Also, I'm sorry if this wasn't the right place to report it. I assumed it was related to how the agent reports metrics, but looking at it now that really seems rather unlikely!
Thanks for you help, feel free to close this.
Hi @ligustah, than you for sending that information. We'll update this thread when we have more information to share with you.
:fearful:
馃槺
馃拃
We were able to reproduce this erratic behavior in the CPUReservation ECS metric on one of our clusters. We created a cluster with a single c3.8xlarge instance (which has 32 cores) and a single ECS service, whose task definition had a cpu reservation of 2600 units.
We noticed that when there were only a few tasks in the service, the CPUReservation metric told the truth and showed the value we would expect - at 6 tasks the reservation was:
6 x 2600
----------- = 47.6%
32 x 1024
And as we increased the number of tasks from 6 to 7, 8, 9 and 10, the CPUReservation numbers continued to follow the values we calculated. At 11 though, where we'd expect the reservation to be 87.3%, the metric became erratic and jumped between many different values:

Does the service team have any updates on this issue? Is it something the community could help fix within the agent, or is it a problem on the control/data plane side?
Thank you! And as a token of our gratitude, we wrote up this haiku:
springtime often lies: beware ECS metrics and close your windows
We're seeing the exact same issue at the cluster level but also at the service level as well. It seems to be the agent not sending those metrics to CloudWatch which is effecting the cluster's average.
Below is a graph of the services running on the cluster. Notice the gaps in the graph for the missing data points.

Also base off the comment from @jakepruitt, there is an instance in our cluster that's running 12 services currently and is having the issue. But the other instance that has only 2-3 services seems to be reporting metrics correctly.
More than happy to provide any more details on this.
hey @jakepruitt, this issue was fixed from the ECS service team side.
@jveldboom, we're closing this issue for now. please let us know if you encounter this again and feel free to open a separate issue for your specific case. thank you.
Most helpful comment
We were able to reproduce this erratic behavior in the
CPUReservationECS metric on one of our clusters. We created a cluster with a singlec3.8xlargeinstance (which has 32 cores) and a single ECS service, whose task definition had acpureservation of2600units.We noticed that when there were only a few tasks in the service, the
CPUReservationmetric told the truth and showed the value we would expect - at 6 tasks the reservation was:And as we increased the number of tasks from 6 to 7, 8, 9 and 10, the CPUReservation numbers continued to follow the values we calculated. At 11 though, where we'd expect the reservation to be 87.3%, the metric became erratic and jumped between many different values:
Does the service team have any updates on this issue? Is it something the community could help fix within the agent, or is it a problem on the control/data plane side?
Thank you! And as a token of our gratitude, we wrote up this haiku: