Node_exporter: Kernel bug skewing node_cpu{mode="steal"} metrics

Created on 21 Nov 2017  路  22Comments  路  Source: prometheus/node_exporter

Host operating system: output of uname -a

Debian 9
Linux ip-10-11-1-110 4.9.0-4-amd64 #1 SMP Debian 4.9.51-1 (2017-09-28) x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

prom/node-exporter:v0.15.0

node_exporter command line flags

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

sum(rate(node_cpu{instance="$instance"}[1m])) by (mode) * 100 / count_scalar(node_cpu{mode="user", instance="$instance"})

What did you expect to see?

Around 100% usage if we aggregate all mode values

What did you see instead?

Some level of steal is expected as this is running on AWS, but I believe the value is out of proportion.

Element | Value
-- | --
{mode="guest"} | 0
{mode="user"} | 0
{mode="iowait"} | 0.033333333333366175
{mode="system"} | 0
{mode="irq"} | 0
{mode="idle"} | 94.95555555551417
{mode="steal"} | 67456900655.031654
{mode="nice"} | 0
{mode="softirq"} | 0

Stats

# cat /proc/stat
cpu  10054 454 10504 1435808 6220 0 64 1802 0 0
cpu0 4970 268 5291 717279 3584 0 8 752 0 0
cpu1 5084 186 5212 718529 2636 0 56 1049 0 0
pendinclose

Most helpful comment

According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10.

It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.

Note that not all guests handle this condition in the same way: 3.2
guests still get the overflow in /proc/stat, but their scheduler
continues to work as expected. 3.16 guests OTOH go nuts once steal time
overflows and stop accumulating system & user time, while entering an
erratic state where steal time in /proc/stat is decreasing on every
clock tick.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61

All 22 comments

This sounds like you're getting funny data out of /proc/stat at some point. In order to debug this we need the raw samples from your system.

You can query this with the instant API or expression browser.

node_cpu{instance="$instance",mode="steal"}[5m]

Here:

Element | Value
-- | --
node_cpu{cpu="cpu1",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"} | 13823647282.09 @1511271040.93112768804791.68 @1511271055.93111656879505.85 @1511271070.93110569135126.41 @1511271085.9319520238760.52 @1511271100.9318489180768.2 @1511271115.9317448212568.35 @1511271130.9316378702970.78 @1511271145.9315347644978.46 @1511271160.9314318965435.95 @1511271175.9313243113305.56 @1511271190.9312174396524.6 @1511271205.9311105283335.33 @1511271220.93128241980.03 @1511271235.93117391998573.92 @1511271250.93116297118845.05 @1511271265.93115237519455.02 @1511271280.93114185451822.72 @1511271295.93113121491941.37 @1511271310.93112064667409.46 @1511271325.931
node_cpu{cpu="cpu0",instance="$instance",instance_ip="10.10.1.232",job="node-exporter",mode="steal"} | 14983696025.52 @1511271040.93113903879812.11 @1511271055.93112808207266.65 @1511271070.93111754950409.44 @1511271085.93110629547241.36 @1511271100.9319505729706.48 @1511271115.9318397768503.66 @1511271130.9317265626394.45 @1511271145.9316177089198.41 @1511271160.9315109561642.35 @1511271175.9314041241269.69 @1511271190.9312933676475.17 @1511271205.9311838400338 @1511271220.931755016449.89 @1511271235.93118087060379.65 @1511271250.93116983063259.85 @1511271265.93115901661413.24 @1511271280.93114786564860.99 @1511271295.93113680189291.38 @1511271310.93112567867597.25 @1511271325.931

Although as you can see on my /proc/stat they look normal, or you mean at some point in the past?

It looks like the formatting got a little messed up, but after cleaning it up you can see that what is being recorded is completely different from what /proc/stat says. Something is getting corrupted.

13823647282.09 1511271040.931
12768804791.68 1511271055.931
11656879505.85 1511271070.931
10569135126.41 1511271085.931
9520238760.52 1511271100.931
8489180768.2 1511271115.93
17448212568.35 1511271130.931
6378702970.78 1511271145.931
5347644978.46 1511271160.931
4318965435.95 1511271175.931
3243113305.56 1511271190.931
2174396524.6 1511271205.931
1105283335.33 1511271220.931
28241980.03 1511271235.931
17391998573.92 1511271250.931
16297118845.05 1511271265.931
15237519455.02 1511271280.931
14185451822.72 1511271295.931
13121491941.37 1511271310.931
12064667409.46 1511271325.931

What do you get if you manually curl 'http://$instance:9100/metrics' | grep node_cpu?

This is in docker, so using the mapped port:

> curl 'http://localhost:19997/metrics' | grep node_cpu
# TYPE node_cpu counter
node_cpu{cpu="cpu0",mode="guest"} 0
node_cpu{cpu="cpu0",mode="idle"} 998557.46
node_cpu{cpu="cpu0",mode="iowait"} 492.78
node_cpu{cpu="cpu0",mode="irq"} 0
node_cpu{cpu="cpu0",mode="nice"} 6.02
node_cpu{cpu="cpu0",mode="softirq"} 1.91
node_cpu{cpu="cpu0",mode="steal"} 1.208129581039e+10
node_cpu{cpu="cpu0",mode="system"} 661.86
node_cpu{cpu="cpu0",mode="user"} 798.07
node_cpu{cpu="cpu1",mode="guest"} 0
node_cpu{cpu="cpu1",mode="idle"} 998280.19
node_cpu{cpu="cpu1",mode="iowait"} 312.78
node_cpu{cpu="cpu1",mode="irq"} 0
node_cpu{cpu="cpu1",mode="nice"} 1.84
node_cpu{cpu="cpu1",mode="softirq"} 7.64
node_cpu{cpu="cpu1",mode="steal"} 6.6865899412e+08
node_cpu{cpu="cpu1",mode="system"} 670.77
node_cpu{cpu="cpu1",mode="user"} 807.26

Ok I found a major problem with my report, I apologize for the confusion.
The config for the current process is actually:

/usr/bin/docker run \
    -v /proc:/host/proc:ro \
    -v /sys:/host/sys:ro \
    -v /:/rootfs:ro \
    -e HOST_HOSTNAME="/rootfs/etc/hostname" \
    -p 19997:9100 \
    --pid="host" \
    --name %i \
    basi/node-exporter:v1.14.0 \
    -collector.procfs "/host/proc" \
    -collector.sysfs /host/sys \
    -collector.textfile.directory /etc/node-exporter/ \
    -collectors.enabled 'conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs,systemd' \
    -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($$|/)"

I got the wrong information from my config VCS (using 0.15), than the current environment running version (0.14)

PS: Made an edit to my first report, to avoid confusion.

Ahh, great, that makes a lot more sense. 馃槂

Sorry SuperQ the issue is still relevant to 0.14, I just pasted the wrong 'command I was running on the server'.
Ill deploy 0.15 and test before we maybe reopen this as to ensure it was not fixed somehow on 0.15

Ok. I suspect there's some kind of corruption inside docker that is messing up reading the values from /proc/stat. The only way to really know what's going on is to run a debugging binary with a bunch of printing of the raw data.

Not related to node_exporter, but I'm seeing something similar with steal time on a (cloudvps) node.

Idle time (should be at most 100):

# num=-1; while true; do new=$(awk '/^cpu2 /{print $5}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
99
95
56
19
95
93
92
95
...

^-- looks okay

Steal time:

# num=-1; while true; do new=$(awk '/^cpu2 /{print $9}' /proc/stat); \
    test $num -ne -1 && echo $((new-num)); num=$new; sleep 1; done 
-832457433
-396408302
-753175773
-515330792
-792816603
-951379924
-832457433
-554971622
...

^-- off the charts

# uname -a
Linux hostname 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu xenial system, with dmidecode:

System Information
        Manufacturer: OpenStack Foundation
        Product Name: OpenStack Nova
        Version: 2014.1.3

In our case, this was observed when Zabbix reports 100% steal time continuously, while other tools like top and vmstat only report some (when diff > 0).

Interestingly, of the 4 cpu's, CPU0 appears to behave ok with only small steal-time increments.

I'm curious to know what you see on your systems.

We just found a node which experiences the same issue. It's running on EC2 and reports a decreasing steal counter. Applying rate on a decreasing counter will then result in the report 5B CPU seconds spent in steal mode per second ;-)

$ for i in {1..10}; do cat /proc/stat | grep cpu0; sleep 1; done                                                                                                                                                                                                      cpu0 22536 17835 22383 42574061 31924 0 584 1262648371943 0 0
cpu0 22536 17835 22383 42574160 31924 0 584 1258406803117 0 0
cpu0 22536 17835 22383 42574258 31924 0 584 1253808466819 0 0
cpu0 22536 17835 22383 42574357 31924 0 584 1249011926370 0 0
cpu0 22536 17835 22383 42574455 31924 0 584 1242827956866 0 0
cpu0 22536 17835 22383 42574553 31924 0 584 1237278240645 0 0
cpu0 22536 17835 22383 42574651 31924 0 584 1230975348650 0 0
cpu0 22536 17835 22383 42574749 31924 0 584 1224038203373 0 0
cpu0 22536 17835 22383 42574847 31924 0 584 1218765972963 0 0
cpu0 22536 17835 22383 42574945 31924 0 584 1212938770930 0 0

Kernel: 4.10.15

According to https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/ this affects guest kernel versions 4.8, 4.9 and 4.10.

It also appears that similar bugs existed before in Kernel 3.x, but was fixed in 4.x < 4.8.

Note that not all guests handle this condition in the same way: 3.2
guests still get the overflow in /proc/stat, but their scheduler
continues to work as expected. 3.16 guests OTOH go nuts once steal time
overflows and stop accumulating system & user time, while entering an
erratic state where steal time in /proc/stat is decreasing on every
clock tick.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785557;msg=61

Amazing find, I did not have time to correctly add more information but we still see the issue, thanks for reopening/finding this.

Oh nice :)
Interesting question how to solve this from our side. At least we should keep this issue open for tracking but we could go as far as logging a warning for known-to-be-bad kernel versions?

@discordianfish We had a chat on IRC and mostly agreed that tracking kernel bugs is mostly out of scope for the node_exporter. However we want to document this case in the README under known issues.

One thing that came up in conversation was the option to split out the steal mode from the normal metric, like we do with guest and guest_nice.


What do people think about this?

Is there double counting in node_cpu due to steal? That'd be the only reason to split it out in my mind.

I did some investigation, and I came to the conclusion that "steal" is valid as a label. I don't know if there's a lot we can do in the node_exporter, as this is a kernel bug.

@grobie @discordianfish Do you think there is anything we should do about this given it's a kernel bug, not a node_exporter bug?

Hi @SuperQ , I don't think node_exporter should do anything to get around this issue. Since the issue indicates an abnormal situation, it is exactly the expected behavior for node_exporter to catch and report it.

While we could choose to emit warnings (log and/or new metric) if run on problematic kernels, I don't think that would be sustainable.

The issue only happens (occasionally) after a VM live migration. We can:

  1. Upgrade to a newer kernel;
  2. Or pay more attention to the status of the VM during migrations (so that we can act quickly after the issue pops up).

Sadly, as far as I know, the only way to fix the /proc/stat counters is to restart the VM.

Yes I think we can't do much here, so let's close this.

Was this page helpful?
0 / 5 - 0 ratings