Containers-roadmap: [EKS] [request]: More CloudWatch Metrics and Configuration

Created on 27 Oct 2019 · 2Comments · Source: aws/containers-roadmap

Which service(s) is this request for?
EKS

Tell us about your request
And after setting up and taking a look into the CloudWatch Container Insights for Amazon EKS, I think this is a good way to collect the logs and metrics for Kubernetes. However, I think the CloudWatch Agent is now lack of configurations and metrics to make it more useful. The following is the configurations and metrics that I suggest:

Configurations

Ability to select which namespaces to collect the metrics
Ability to select which app to collect the metrics. The labels could be used for this config, for example: labels.app or labels.k8s-app
Ability to select which metrics to collect

Metrics

node and pod memory utilization in bytes
node and pod cpu utilization in mili core
container memory page faults
the metrics should collected for separated pods. Now it was collected for deployments, daemon sets... under the name PodName. And I think it is not really helpful because the problem can occurs on a single pod of the deployments while the metrics is sum of all the pods.
The Alarm can be trigger by every single pod of the deployment.
The health condition of a job or cron-job: number of jobs created, number of the success and failed jobs...

EKS Proposed

Source

khacminh

👍12

Most helpful comment

For most organization metrics are a good portion of the infrastructure and require multiple tools to achieve this. Nowadays also, companies are migrating to EKS or already using it for a while and they still lack this kind of metrics unless they add another layer in their toolset. With the release of Container Insights, everyone thought that they were going to be able to finish transferring everything to their CloudWatch dashboards and alarms but when they realized that the metrics were very little they end up deciding not to use it.

EKS has a lot of other things more than filesystem usage in the node and super basic networking metrics. People want to know if their deployments went successful or if a node it's ready to use, or how many pods are in wait mode or nodes if they are schedulable or not. If the deployments were paused for too long and many other things (daemon sets, replica sets, services, deployments), and some other super useful ones like if their services are generating HTTP errors (3xx, 4xx, 5xx).

I feel if CloudWatch team steps up and adds this to the scope, to be able to get more metrics out of ContainerInsights will make most of the companies start using a lot more their monitoring services.

Now my question goes, does CloudWatch team or the team that owns this kind of project, set a roadmap for this?