Cadvisor: Average CPU % usage per container

Created on 21 Aug 2018 · 25Comments · Source: google/cadvisor

In my Kubernetes installation, I can see cAdvisor reports a measurement in the Prometheus output called "container_cpu_load_average_10s" for each pod/container. I get values such as 232, 6512 and so on.

So, what is the unit of measure for CPU load here? To me "CPU Load" and "CPU Usage" are used interchangeably, so I can't understand why its not a value between [0-100] ?

Here I put the related line from cAdvisor log:

...
container_cpu_load_average_10s{container_name="",id="/system.slice/kubelet.service",image="",name="",namespace="",pod_name=""} 1598
...

kinsupport

Source

michelgokan

👍1

Most helpful comment

This issue seems to be where a lot of people land when trying to find out how to calculate CPU usage metric correctly in prometheus, myself included! So I'll post what I eventually ended up using as I think it's still a little difficult trying to tie together all the snippets of info here and elsewhere.

This is specific to k8s and containers that have CPU limits set. Please correct me if any of this is wrong.

To show CPU usage as a percentage of the limit given to the container, this is the Prometheus query we used to create nice graphs in Grafana:

sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)

It returns a number between 0 and 1 so format the left Y axis as percent (0.0-1.0) or multiply by 100 to get CPU usage percentage.

Note that we added some filtering here to get rid of some noise: name!~".*prometheus.*", image!="", container_name!="POD". The name!~".*prometheus.*" is just because we aren't interested in the CPU usage of all the prometheus exporters running in our k8s cluster.

Screen Shot 2019-04-24 at 10 58 31

Hope this helps!

max-rocket-internet on 24 Apr 2019

👍76 ❤10 🎉7 🚀2

All 25 comments

The units are in number of tasks. This article does a decent job of explaining it: https://serverfault.com/questions/667078/high-cpu-utilization-but-low-load-average

dashpole on 21 Aug 2018

👍1

@dashpole That’s a perfect explanation. Thank you. Can you suggest a way to calculate average CPU usage (in percentage) per pod/container using cAdvisor’s prometheus export data? I didn’t find anything directly related.

michelgokan on 21 Aug 2018

It depends on the what you are using to run containers. In kubernetes, container_spec_cpu_quota maps to container limits, and container_spec_cpu_shares is based on container requests.

If you wanted to know the percentage of cpu a container was using, you should be able to do something like container_cpu_usage_seconds_total/container_spec_cpu_quota* some constant. I don't remember what quota is measured in off the top of my head.

dashpole on 23 Aug 2018

👍7

@dashpole Thank you for your detailed explanation and time. I appreciate it. To clarify: What do you mean by "some constant" and where did you get the formula above? Any links/references would be appreciated.

michelgokan on 24 Aug 2018

My previous formula was actually incomplete, as the "some constant" is a configurable parameter, cpu.cfs_period_us. Containers use the cgroup attributes cpu.cfs_period_us and cpu.cfs_quota_us to limit a container's CPU usage. See the redhat's definitions. Both attributes are measured in microseconds, although technically cpu.cfs_quota_us is microseconds of CPU time (not just microseconds). container_cpu_usage_seconds is measured in seconds. You can get the container's limit in CPUs/second by cpu.cfs_quota_us / cpu.cfs_period. Then to get the percentage of this limit used over a period of time, take the rate of CPU usage rate(container_cpu_usage_seconds[10m])/(container_spec_cpu_quota / container_spec_cpu_period).

dashpole on 24 Aug 2018

👍23

what is the difference bewteem docker stats and
rate(container_cpu_usage_seconds[10m])/(container_spec_cpu_quota / container_spec_cpu_period)?
@dashpole

I use rate(container_cpu_usage_seconds[10m])/(container_spec_cpu_quota / container_spec_cpu_period) to show container in Prometheus.
But the result is difference with docker stats.
what is their releationship ？
Thanks.

szediktam on 25 Feb 2019

@szediktam
docker stats is the % of the host's cpu and memory: https://docs.docker.com/engine/reference/commandline/stats/#examples.
your query will give you usage in cores.

I also suspect that docker stats is an average over a much smaller time window than 10 minutes

dashpole on 25 Feb 2019

thanks @dashpole ,

But i tried the following query, "no data" is returned.

rate(container_cpu_usage_seconds_total[10m]) / (container_spec_cpu_quota / container_spec_cpu_period)

I tried a few hours but it does not work. Is the query something wrong?

jehos on 28 Mar 2019

👍7

@jehos do each of the individual metrics exist?

dashpole on 28 Mar 2019

@dashpole yes, ofcourse.

rate(container_cpu_usage_seconds_total[10m])

...

{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="kubernetes-dashboard",container_name="kubernetes-dashboard",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/8a665872c7e45d23511f8cc2e9fc585e10f4977861f9c40e4e6b5bc259cc4319",image="sha256:f9aed6605b814b69e92dece6a50ed1e4e730144eb1cc971389dde9cb3820d124",instance="k8s",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_kubernetes-dashboard_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_12",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.0007485564337162398
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="kubernetes-dashboard",container_name="kubernetes-dashboard",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/8a665872c7e45d23511f8cc2e9fc585e10f4977861f9c40e4e6b5bc259cc4319",image="sha256:f9aed6605b814b69e92dece6a50ed1e4e730144eb1cc971389dde9cb3820d124",instance="k8s",job="kubernetes-nodes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_kubernetes-dashboard_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_12",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.0007593627475261611
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b",instance="k8s",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.0007427663630978316
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b",instance="k8s",job="kubernetes-nodes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.0007536835107680461
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="POD",container_name="POD",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/04a6ada9b612aed85700945acb497d2d2b4b8f9fbfc07804a24f9e18a6675b91",image="k8s.gcr.io/pause:3.1",instance="k8s",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_POD_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_38",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="POD",container_name="POD",cpu="total",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/04a6ada9b612aed85700945acb497d2d2b4b8f9fbfc07804a24f9e18a6675b91",image="k8s.gcr.io/pause:3.1",instance="k8s",job="kubernetes-nodes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_POD_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_38",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"}
...

(container_spec_cpu_quota / container_spec_cpu_period)

{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="kubernetes-dashboard",container_name="kubernetes-dashboard",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/8a665872c7e45d23511f8cc2e9fc585e10f4977861f9c40e4e6b5bc259cc4319",image="sha256:f9aed6605b814b69e92dece6a50ed1e4e730144eb1cc971389dde9cb3820d124",instance="k8s",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_kubernetes-dashboard_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_12",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.1
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",container="kubernetes-dashboard",container_name="kubernetes-dashboard",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b/8a665872c7e45d23511f8cc2e9fc585e10f4977861f9c40e4e6b5bc259cc4319",image="sha256:f9aed6605b814b69e92dece6a50ed1e4e730144eb1cc971389dde9cb3820d124",instance="k8s",job="kubernetes-nodes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",name="k8s_kubernetes-dashboard_kubernetes-dashboard-cb6749dc6-w7v24_kube-system_6231e452-3e32-11e9-87bb-000ec4d1614b_12",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.1
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b",instance="k8s",job="kubernetes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"} | 0.1
{beta_kubernetes_io_arch="amd64",beta_kubernetes_io_os="linux",id="/kubepods/pod6231e452-3e32-11e9-87bb-000ec4d1614b",instance="k8s",job="kubernetes-nodes-cadvisor",kubernetes_io_arch="amd64",kubernetes_io_hostname="k8s",kubernetes_io_os="linux",namespace="kube-system",pod="kubernetes-dashboard-cb6749dc6-w7v24",pod_name="kubernetes-dashboard-cb6749dc6-w7v24"}

jehos on 28 Mar 2019

👍1

Tried to get the avg CPU usage per container (in %)
Used the following data point
rate(container_cpu_usage_seconds_total[10m]) / (container_spec_cpu_quota / container_spec_cpu_period) But it says no datapoints found. Can some one help me out in getting the avg cpu usage of a container?

bharadwajambati95 on 12 Apr 2019

This is specific to k8s and containers that have CPU limits set. Please correct me if any of this is wrong.

To show CPU usage as a percentage of the limit given to the container, this is the Prometheus query we used to create nice graphs in Grafana:

sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)

It returns a number between 0 and 1 so format the left Y axis as percent (0.0-1.0) or multiply by 100 to get CPU usage percentage.

Screen Shot 2019-04-24 at 10 58 31

Hope this helps!

max-rocket-internet on 24 Apr 2019

👍76 ❤10 🎉7 🚀2

I can't fine container_spec_cpu_quota, all I find is container_spec_cpu_shares, and container_spec_cpu_period. I have prometheus and cadvisor up and running fine.

XiyuYoyoChen on 17 Jul 2019

👍4

i also tried with below query but it is saying "no data" is returned.

rate(container_cpu_usage_seconds_total[10m]) / (container_spec_cpu_quota / container_spec_cpu_period)

Is there anything missing here ?

individually both rate(container_cpu_usage_seconds_total[10m]) and (container_spec_cpu_quota / container_spec_cpu_period) are returning data but not together ?

aporwal17 on 17 Jul 2019

In general the query given above won't work for containers running in AWS ECS, since ECS sets cpu shares on containers rather than setting cpu quota. I've been digging through AWS docs and tried a few things, but I haven't been able to get ECS to set a quota on the container. I suspect it would work for the Fargate launch type, but I haven't tried that since my cicd pipeline isn't set up to deploy that way automatically.

I'm still trying to figure out a way to get CPU usage as a percentage similar to what cloudwatch shows, but I haven't yet come across anything promising yet. I'll update this thread if I do find something

RCornish74 on 18 Jul 2019

To close the loop on this I ended up running with the following, which lines up with the cpu utilization info I see in cloudwatch. I think the breakthrough was when I found in amazon docs that 1024 cpu shares is equivalent to 1 cpu in the task size section https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#task_size

From there I did a little logic and algebra that essentially ended up being PERCENTAGE = ( x cpu seconds / 1 minute) / (60 cpu seconds / 1 minute) * (1024 cpu shares / y cpu shares), where x and y are things we can get from cAdvisor. Note that cpu shares is inversely proportional to the percentage because as I have more cpu shares I'll be using more cpus, therefore my total cpu seconds can be higher without raising my total percentage used (and vice-versa).

For prometheus we needed to sum all of the cpus on the instance so the expression worked out to sum by (the things you want to see) (rate(container_cpu_usage_seconds_total[60s]) * 60 * 1024) / on (the things you want to see) (container_spec_cpu_shares) / 60 * 100

Full working example from our use case, in case I messed up the syntax while above simplifying it
sum by (container_label_serviceName, instance, name) (rate(container_cpu_usage_seconds_total{container_label_team="myTeamName"}[60s]) * 1024 * 60) / on (container_label_serviceName, instance, name) (container_spec_cpu_shares{container_label_team="myTeamName"}) / 60 * 100

RCornish74 on 18 Jul 2019

👍8 😄3 🎉2

sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)

tried this query from AWS EKS somehow that does not make sense. I have some of the containers that are throttling and when i run the query provided by you on the same containers i get a metrics <1 i.e. after doing *100 on the query. Shouldn't this query mean that if the value >100 then there will be throttling for the container? or am I missing something?

imdhruva on 5 Sep 2019

@imdhruva

tried this query from AWS EKS somehow that does not make sense

We are also running EKS. The k8s version shouldn't make any difference.

Shouldn't this query mean that if the value >100 then there will be throttling for the container? or am I missing something?

We see this also. I think the answer is that Linux Kernel cgroups simply don't work exactly in the way you might expect. There's a bunch of interesting information here: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-cpu

max-rocket-internet on 5 Sep 2019

👍1

@jehos do each of the individual metrics exist?

@dashpole I have 10 worker node k8 cluster, i have node_exporter running on these and cadvisor (as a service) on all these hosts with port no. 8080. When i query from prometheus UI, data is there, when i doing curl from prometheus server, i can see the data, but grafana dashboard which i imported is not showing data. Even when i query in prometheus UI, it shows data for same IP of worker node for many services. Please help in understanding that how to visualize these metrics which cadvisor has scrapped and stored in prometheus.

RAHUL8491 on 15 Nov 2019

This issue seems to be where a lot of people land when trying to find out how to calculate CPU usage metric correctly in prometheus, myself included! So I'll post what I eventually ended up using as I think it's still a little difficult trying to tie together all the snippets of info here and elsewhere.

This is specific to k8s and containers that have CPU limits set. Please correct me if any of this is wrong.

To show CPU usage as a percentage of the limit given to the container, this is the Prometheus query we used to create nice graphs in Grafana:
sum(rate(container_cpu_usage_seconds_total{name!~".*prometheus.*", image!="", container_name!="POD"}[5m])) by (pod_name, container_name) /
sum(container_spec_cpu_quota{name!~".*prometheus.*", image!="", container_name!="POD"}/container_spec_cpu_period{name!~".*prometheus.*", image!="", container_name!="POD"}) by (pod_name, container_name)
It returns a number between 0 and 1 so format the left Y axis as percent (0.0-1.0) or multiply by 100 to get CPU usage percentage.

Note that we added some filtering here to get rid of some noise: name!~".*prometheus.*", image!="", container_name!="POD". The name!~".*prometheus.*" is just because we aren't interested in the CPU usage of all the prometheus exporters running in our k8s cluster.

Hope this helps!

@max-rocket-internet Hi, we removed the limits for the pods/deployment as per this https://github.com/kubernetes/kubernetes/issues/51135 and this query
sum(rate(container_cpu_usage_seconds_total{pod=~"$pod", container!=""}[5m])) by (pod_name,container_name) /sum(container_spec_cpu_quota{pod=~"$pod", container!=""}/container_spec_cpu_period{pod=~"$pod", container!=""}) by (pod_name,container_name) is not working for us. what else can can be queries instead of limts to get the exact pod CPU Usage

sainathkeesari on 13 Jan 2020

I have actually switched our Grafana dashboards since my last comment. Since some applications have a small request and large limit (to save money) or have an HPA, then just showing a percentage of the limit is sometimes not useful.

So what we do now is display the CPU usage in cores and then add a horizontal line for each of the request and limit. This shows more information and also shows the usage in the same metric that is used in k8s: CPU cores.

CPU usage

Legend: {{container_name}} in {{pod_name}}
Query: sum(rate(container_cpu_usage_seconds_total{pod_name=~"deployment-name-[^-]*-[^-]*$", image!="", container_name!="POD"}[5m])) by (pod_name, container_name)

CPU limit

Legend: limit
Query: sum(kube_pod_container_resource_limits_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

CPU request

Legend: request
Query: sum(kube_pod_container_resource_requests_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

You will need to edit these 3 queries for your environment so that only pods from a single deployment a returned, e.g. replace deployment-name.

The pod request/limit metrics come from kube-state-metrics.

We then add 2 series overrides to hide the request and limit in the tooltip and legend:

Screen Shot 2020-01-13 at 17 05 03

The result looks like this:

Screen Shot 2020-01-14 at 10 05 20

max-rocket-internet on 14 Jan 2020

👍13 🎉7

You can use container_spec_cpu_shares in place of container_spec_cpu_quota in the original query listed https://github.com/google/cadvisor/issues/2026#issuecomment-486134079 to pull what appear to be container CPU requests, but this means you can also potentially see CPU utilization over 100% if usage goes over requests.

Thus, if you don't have requests or if your requests are low with high or non existent limits (have you done tuning and performance testing?!?!?) you might get wicked utilization numbers.

Oh, just read this comment https://github.com/google/cadvisor/issues/2026#issuecomment-415571557, so cpu_shares does seem to map to container requests.

wxdave on 16 Jan 2020

I have actually switched our Grafana dashboards since my last comment. Since some applications have a small request and large limit (to save money) or have an HPA, then just showing a percentage of the limit is sometimes not useful.

So what we do now is display the CPU usage in cores and then add a horizontal line for each of the request and limit. This shows more information and also shows the usage in the same metric that is used in k8s: CPU cores.

CPU usage

Legend: {{container_name}} in {{pod_name}}
Query: sum(rate(container_cpu_usage_seconds_total{pod_name=~"deployment-name-[^-]*-[^-]*$", image!="", container_name!="POD"}[5m])) by (pod_name, container_name)

CPU limit

Legend: limit
Query: sum(kube_pod_container_resource_limits_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

CPU request

Legend: request
Query: sum(kube_pod_container_resource_requests_cpu_cores{pod=~"deployment-name-[^-]*-[^-]*$"}) by (pod)

You will need to edit these 3 queries for your environment so that only pods from a single deployment a returned, e.g. replace deployment-name.

The pod request/limit metrics come from kube-state-metrics.

We then add 2 series overrides to hide the request and limit in the tooltip and legend:

The result looks like this:

hi @max-rocket-internet, I am kinda new to this game of running K8 workloads. Is there a way to add more colour to these CPU utilisation charts? Could I get something like this?

Screenshot 2020-02-13 at 17 46 44

Thanks in advance!

l1x on 13 Feb 2020

If you're running node_exporter in conjunction w/ cAdvisor, you can see the CPU usage from the host's perspective using a PromQL query like this one:

sum by (instance) (irate(container_cpu_usage_seconds_total{cluster=~"$cluster", instance=~"$host", name="$service"}[$__rate_interval])) / sum by (instance) (irate(node_cpu_seconds_total{job="node", instance=~"$host", cluster=~"$cluster"}[$__rate_interval])) * 100

kylejmcintyre on 30 Sep 2020

If you only care about the usage for a specific deployment, you can use:

sum(rate(container_cpu_usage_seconds_total{pod=~"^tiger.*", container!=""}[5m])) by (pod, container) /
sum(container_spec_cpu_quota{pod=~"^tiger.*", container!=""}/container_spec_cpu_period{pod=~"^tiger.*", container!=""}) by (pod, container)

Just replace "deploy" with your deployment name.