Amazon-ecs-agent: Reservation metrics very unstable

Created on 21 Apr 2017 · 9Comments · Source: aws/amazon-ecs-agent

Hello,

we're running three ECS clusters and all of them show this behavior occasionally. See the images below:
cloudwatch
ecs_console

The workload on these clusters is fairly stable and we only occasionally deploy new versions of containers (i.e. updating task definitions and services), but we're not adding or removing any services. It only ever starts or stops after deployments, though it's not always the same service causing/fixing it. We also hardly ever adjust resource reservations for the services.

All services work normally, so the issue seems to not affect scheduling. We do use the metrics to trigger auto scaling actions, though, so these get very flaky, too. Also completely replacing the instances doesn't seem to get rid of the problem.

The only abnormal things in the logs seem to be lines like these:

[INFO] Redundant container state change for task, Status: (RUNNING->RUNNING) Containers: [(RUNNING->RUNNING),]: (RUNNING->RUNNING) to RUNNING, but already RUNNING

Any idea what might be causing this?

kinbug scopECS Service

Source

ligustah

Most helpful comment

We were able to reproduce this erratic behavior in the CPUReservation ECS metric on one of our clusters. We created a cluster with a single c3.8xlarge instance (which has 32 cores) and a single ECS service, whose task definition had a cpu reservation of 2600 units.

We noticed that when there were only a few tasks in the service, the CPUReservation metric told the truth and showed the value we would expect - at 6 tasks the reservation was:

 6  x 2600
-----------  =  47.6%
 32 x 1024

And as we increased the number of tasks from 6 to 7, 8, 9 and 10, the CPUReservation numbers continued to follow the values we calculated. At 11 though, where we'd expect the reservation to be 87.3%, the metric became erratic and jumped between many different values:

cloudwatch_management_console

Does the service team have any updates on this issue? Is it something the community could help fix within the agent, or is it a problem on the control/data plane side?

Thank you! And as a token of our gratitude, we wrote up this haiku:

springtime often lies: beware ECS metrics and close your windows

jakepruitt on 23 May 2017

😄5

All 9 comments

Hi @ligustah, thank you for reporting this issue. I have made the ECS service team aware of this issue. Can you please provide us the cluster ARN to help us debug this issue (cluster name + AWS account ID + region would also be sufficient)? You can email it to aithal at amazon dot com if you prefer. Alternatively, you can also engage AWS support as we like to use this repository to work on ECS Agent issues and this seems related to how ECS computes metrics.

Thanks,
Anirudh

aaithal on 25 Apr 2017

Hi @aaithal ,
I just sent a mail to the address you mention with the information you requested.

Also, I'm sorry if this wasn't the right place to report it. I assumed it was related to how the agent reports metrics, but looking at it now that really seems rather unlikely!

Thanks for you help, feel free to close this.

ligustah on 27 Apr 2017

Hi @ligustah, than you for sending that information. We'll update this thread when we have more information to share with you.

aaithal on 27 Apr 2017

👍1

:fearful:

yhahn on 23 May 2017

😱

xrwang on 23 May 2017

💀

jakepruitt on 23 May 2017

We noticed that when there were only a few tasks in the service, the CPUReservation metric told the truth and showed the value we would expect - at 6 tasks the reservation was:

 6  x 2600
-----------  =  47.6%
 32 x 1024

cloudwatch_management_console

Does the service team have any updates on this issue? Is it something the community could help fix within the agent, or is it a problem on the control/data plane side?

Thank you! And as a token of our gratitude, we wrote up this haiku:

springtime often lies: beware ECS metrics and close your windows

jakepruitt on 23 May 2017

😄5

We're seeing the exact same issue at the cluster level but also at the service level as well. It seems to be the agent not sending those metrics to CloudWatch which is effecting the cluster's average.

Below is a graph of the services running on the cluster. Notice the gaps in the graph for the missing data points.

screen shot 2017-05-27 at 11 00 30 pm

Also base off the comment from @jakepruitt, there is an instance in our cluster that's running 12 services currently and is having the issue. But the other instance that has only 2-3 services seems to be reporting metrics correctly.

More than happy to provide any more details on this.

jveldboom on 28 May 2017

hey @jakepruitt, this issue was fixed from the ECS service team side.

@jveldboom, we're closing this issue for now. please let us know if you encounter this again and feel free to open a separate issue for your specific case. thank you.

adnxn on 18 Jul 2017

❤2

Was this page helpful?

0 / 5 - 0 ratings

Related issues

HostPort not present in ECS Task Metadata Endpoint response with bridge network type

MartinMitro · 3Comments

AWS Parameter Store for user specific secrets

pspanchal · 3Comments

Option to select multiple log-configuration

soumyasmruti · 5Comments

Service:AmazonECS, Code:ClientException, Message:Actual length: '34432'. Max allowed length is '32768' bytes., Class:com.amazonaws.services.ecs.model.ClientException

devotox · 3Comments

Logentries docker driver

AbelGuti · 5Comments