I have enabled "container_cpu_load_average_10s" for my cluster and its working well. Now when I see this metric on my Prometheus browser, it gives values like 1000, 1100.
As per my understanding load average is the number of processes waiting to get processed and if this definition is true then I really doubt the calculated value is giving the actual scenario.
I want to understand what these values ( metric value) indicate and if there is anything to make these values useful.
I'm seeing values I don't understand too. A quick look with ps within my container shows only one or two processes in the R or D state:
root@nginx-7f679d96bc-9t7zr:/# ps -eLo state,tid,args | awk '$1 ~ /^(R|D)/'
R 591 stress-ng -c 1 - l 100
R 750 ps -eLo state,tid,args
However, container_cpu_load_average_10s is approximately 800.
Not sure if CPU limits play a part in the cAdvisor calculation, but this container has a limit of 250m on a VM with 4vCPU, and I'm pushing the CPU using stress-ng to test my prometheus setup (perhaps CPU throttling has an impact on this metric?)
@dashpole Any idea here. I am still getting such strange values.
How can I make these values useful for my monitoring?
Thanks
From https://github.com/google/cadvisor/blob/master/info/v1/container.go#L320:
"Smoothed average of number of runnable threads x 1000. We multiply by thousand to avoid using floats, but preserving precision. Load is smoothed over the last 10 seconds. Instantaneous value can be read from LoadStats.NrRunning."
finally have a reason for why this load metric become such high on contianers
Most helpful comment
From https://github.com/google/cadvisor/blob/master/info/v1/container.go#L320:
"Smoothed average of number of runnable threads x 1000. We multiply by thousand to avoid using floats, but preserving precision. Load is smoothed over the last 10 seconds. Instantaneous value can be read from LoadStats.NrRunning."