Containers-roadmap: [EKS] [request]: EKS Control Plane Metrics Available In CloudWatch

Created on 17 Mar 2020  路  4Comments  路  Source: aws/containers-roadmap

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

In some scenarios it is useful for Kubernetes operators to know the health of the EKS control plane. Some applications or pods may overload the control plane and it can be helpful to know this. Having control plane metrics in cloudwatch such as:

  • apiserverRequestCount
  • apiserverRequestErrCount
  • apiserverLatencyBucket
  • kubeNodes
  • kubePods

can help customers diagnosing slowness or unresponsiveness to the control plane

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Sometimes if the control plane is slow we would like to know if there has been a spike in requests to the API, is there a spike in amount of errors. Did we have a spike in new pods .

Are you currently working around this issue?
Scraping the /metrics endpoint on the Kubernetes service

EKS Proposed

Most helpful comment

Hey everyone, I鈥檓 a Product Manager for CloudWatch. We are looking for people to join our beta program to provide feedback and test Prometheus metric monitoring in CloudWatch. The beta program will allow you to test the collection of the EKS Control Plane Metrics exposed as Prometheus metrics. Email us if interested, [email protected].

All 4 comments

Hey everyone, I鈥檓 a Product Manager for CloudWatch. We are looking for people to join our beta program to provide feedback and test Prometheus metric monitoring in CloudWatch. The beta program will allow you to test the collection of the EKS Control Plane Metrics exposed as Prometheus metrics. Email us if interested, [email protected].

Can we include the cluster component status into the CloudWatch as well, for example:

  • kube controller manager
  • scheduler (http://localhost:8001/api/v1/componentstatuses/scheduler)

These can be used to set up CloudWatch alarm when a custom webhook breaks the component, for example, a newly installed ValidatingWebhook that breaks the Scheduler renew lease calls.

@starchx - https://aws.amazon.com/about-aws/whats-new/2020/09/amazon-cloudwatch-monitors-prometheus-metrics-container-environments/.
You can use CloudWatch Prometheus agent to support above use case. In the first phase (already available), we encourage you to configure the agent to consume Control plane metrics for EKS and leverage CloudWatch alarms. In the second phase, we will also build an automated and out of the box dashboard for EKS Control plane.
Check out this workshop to learn more: https://observability.workshop.aws/en/containerinsights/eks/_prometheusmonitoring.html

I don't mind scraping the endpoints myself since I use Datadog for monitoring but not having access to the schedulers or control plane manger metrics endpoint is tough. For example, without access to the kube scheduler my team and I are unable to track "time to schedule a pod" which is a key service level indicator for us.

https://github.com/DataDog/integrations-core/blob/master/kube_scheduler/datadog_checks/kube_scheduler/kube_scheduler.py#L41

Was this page helpful?
0 / 5 - 0 ratings