Node_exporter: cpu collector: cpufreq/scaling_cur_freq not created by intel_pstate governor

Created on 16 Aug 2017 · 20Comments · Source: prometheus/node_exporter

@SuperQ @rtreffer

Host operating system: output of `uname -a`

Linux hostname 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter -version`

node_exporter, version 0.14.0 (branch: zoneinfo_collector, revision: 97643f5dcb1f5b637977e8f99892dd55d0b34cac)
  build user:       trangoni@sargas
  build date:       20170814-10:37:55
  go version:       go1.8.3

Are you running node_exporter in Docker?

What did you do that produced an error?

# /usr/sbin/node_exporter --collectors.enabled 'cpu'
INFO[0000] Starting node_exporter (version=0.14.0, branch=zoneinfo_collector, revision=97643f5dcb1f5b637977e8f99892dd55d0b34cac)  source="node_exporter.go:137"
INFO[0000] Build context (go=go1.8.3, user=trangoni@sargas, date=20170814-10:37:55)  source="node_exporter.go:138"
INFO[0000] Enabled collectors:                           source="node_exporter.go:157"
INFO[0000]  - cpu                                        source="node_exporter.go:159"
INFO[0000] Listening on :9100                            source="node_exporter.go:183"
ERRO[0003] ERROR: cpu collector failed after 0.003984s: open /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory  source="node_exporter.go:94"

What did you expect to see?

A valid node_cpu_frequency_hertz metric or, well, that the node_exporter handles the missing file and shows no errors.

See how it looks like on CentOS 7.3 Haswell Server for 'cpu0',

# for file in /sys/bus/cpu/devices/cpu0/cpufreq/*; do echo "$file: $(cat $file)" ; done
/sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq: 1200195
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq: 3300000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency: 4294967295
/sys/bus/cpu/devices/cpu0/cpufreq/related_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors: performance powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver: intel_pstate
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor: powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq: 3300000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed: <unsupported>

See also Red Hat Bugzilla #1085525 as it seems, it won't be fixed upstream.

cpuinfo_cur_freq works as expected but this is only accessible by root,
-r-------- 1 root root 4096 Aug 16 12:36 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq

What did you see instead?

No cpu_freq and thermal_throttle metrics at all. However, the node_exporter warns about the scaling_cur_freq and outputs errors.

accepted bug

Source

mjtrangoni

All 20 comments

Well, It seem to be fixed upstream with this commit,

3.18-rc2:

commit c034b02e213d271b98c45c4a7b54af8f69aaac1e
Author: Dirk Brandewie <[email protected]>
Date:   Mon Oct 13 08:37:40 2014 -0700

    cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers

And fortunately this is fixed in RHEL7.4 too,

# rpm -q kernel-3.10.0-693.el7.x86_64 --changelog | grep scaling_cur_freq
- [cpufreq] expose scaling_cur_freq sysfs file for set_policy() drivers (Oleksandr Natalenko) [1382608]

See,

# for file in /sys/bus/cpu/devices/cpu0/cpufreq/*; do echo "$file: $(cat $file)" ; done
/sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq: 1486007
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq: 1900000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency: 4294967295
/sys/bus/cpu/devices/cpu0/cpufreq/related_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors: performance powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: 1486007
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver: intel_pstate
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor: performance
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq: 1900000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed: <unsupported>

We will have to wait for CentOS 7.4.
Feel free to consider handling this case or closing the issue.

mjtrangoni on 16 Aug 2017

That's a pretty annoying kernel bug. I think the node_exporter handling in this case is ok. It knows that there should be a file to read because the directory is there. It produces an error because the file permissions are messed up.

Thanks for documenting the bug so others can find the fix.

SuperQ on 16 Aug 2017

@SuperQ Just to clarify: This is not a permission problem. The file scaling_cur_freq simply does not exist and the cpu loop returns prematurely - even though the other files do exist and could be used to provide interesting metrics.

IMHO it would be nice if this case would be handle more gracefully as it causes lots of missing metrics on RHEL/CentOS 7 systems if they are not already running on the very latest 7.4 version (which is not even released in the case of CentOS).

knweiss on 23 Aug 2017

@knweiss Hmm, that's a tough call. I would prefer to fail the collector rather than return partial results in this case. But, it may be worth changing this policy.

SuperQ on 23 Aug 2017

If a metric doesn't exist due to the software version not supporting it, it's fine not to return it. What's not fine is sometimes returning it and sometimes not, and that sort of failure is where you need to be careful with partial results.

brian-brazil on 24 Aug 2017

👍1

Yes, the design idea here is that if you get metrics from a specific collector, it will return "all" of the metrics you would expect from that collector. Having partial results breaks this design idea as @brian-brazil points out.

SuperQ on 24 Aug 2017

@brian-brazil @SuperQ I have one question though: Does the partial results rule apply to a) the cpu collector as a whole or b) to each of its major sections (UpdateStat metrics, cpufreq metrics, thermal_throttle metrics)?

If the partial results rule applies to the entire cpu collector it would be broken on RHEL 7.3 because it only returns the UpdateStat() metrics because of the missing scaling_cur_freq file. I.e. a partial result.

Also, please take a look at the cpu loop in updateCPUfreq(). If one of the directories cpufreq or thermal_throttle is missing, the code logs a debug message (but does not return an error!) and continues. This causes partial results by definition a).

OTOH: If the partial results rule applies to each of the cpu collector's major sections the thermal_throttle metrics section (its files exist even on RHEL 7.3!) in the cpu loop probably should not be omitted entirely because of the early return in the cpufreq section (caused by the missing scaling_cur_freq file check that is execute before).

knweiss on 24 Aug 2017

Yea, it's a bit of a mess. My view is b, the major sections. We could split out the cpufreq and thermal_throttle sections as separate collectors. These are only useful for hardware users, and not VM users.

SuperQ on 24 Aug 2017

I'd be fine with that.

grobie on 26 Aug 2017

I saw a fix was merged in via https://github.com/prometheus/node_exporter/pull/657

However, I am running node-exporter v0.15.1 on RHEL 7.3 and still see the same issue ...

kubernetes.host:ip-10-103-105-37.eu-central-1.compute.internal kubernetes.container_name:node-exporter @timestamp:December 20th 2017, 15:09:21.262 log:time=\"2017-12-20T15:09:21Z\" level=error msg=\"ERROR: cpu collector failed after 0.000752s: open /host/sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory\" source=\"collector.go:123\"\n

It looks look like this error causes the pod to become SandboxChanged and restarts the pod.

Note: The file doesn't exist on the node that that part is accurate.

swade1987 on 20 Dec 2017

node_exporter: v0.15.1 running out of systemd
os: centos
machine: r4.large aws-ec2

im also seeing these issues with the 0.15.1 release of node_exporter outside of Kubernetes:

Jan 16 08:53:19 elasticsearch-01 node_exporter[9723]: time="2018-01-16T08:53:19-06:00" level=error msg="ERROR: cpu collector failed after 0.000357s: open /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory" source="collector.go:123"

uname:

Linux elasticsearch-01 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Running from EC2 r4.large, and can confirm that the file doesn't exist.

Is there a way to ignore these errors or ignore the given file?

jbkc85 on 16 Jan 2018

# find /sys/bus/cpu/devices/*/cpufreq/ 13172 0 drwxr-xr-x 2 root 13180 0 -rw-r--r-- 1 root 13175 0 -r--r--r-- 1 root 13181 0 -r--r--r-- 1 root 13184 0 -r-------- 1 root 13182 0 -r--r--r-- 1 root 13174 0 -r--r--r-- 1 root 13173 0 -r--r--r-- 1 root 13177 0 -rw-r--r-- 1 root 13178 0 -r--r--r-- 1 root 13176 0 -rw-r--r-- 1 root 13179 0 -r--r--r-- 1 root 13183 0 -rw-r--r-- 1 root 13185 0 drwxr-xr-x 2 root 13193 0 -rw-r--r-- 1 root 13188 0 -r--r--r-- 1 root 13194 0 -r--r--r-- 1 root 13197 0 -r-------- 1 root 13195 0 -r--r--r-- 1 root 13187 0 -r--r--r-- 1 root 13186 0 -r--r--r-- 1 root 13190 0 -rw-r--r-- 1 root 13191 0 -r--r--r-- 1 root 13189 0 -rw-r--r-- 1 root 13192 0 -r--r--r-- 1 root 13196 0 -rw-r--r-- 1 root -ls root 0 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/ root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/related_cpus root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed root 0 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/ root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_governor root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_transition_latency root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_driver root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_cur_freq root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_available_governors root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_max_freq root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_min_freq root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_max_freq root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/affected_cpus root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_min_freq root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/related_cpus root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_setspeed

jbkc85 on 16 Jan 2018

657 is completely unrelated to this. As stated above, this is a kernel bug. You can either upgrade your kernel, disable the cpu collector or just live with this error since it should still return results for the other metrics. Are there any more important metrics missing?

discordianfish on 16 Jan 2018

What is the affected Linux version range? Does this mean 3.18 is the oldest supported Linux version? If this is the case, we should document this restriction.

matthiasr on 25 Jan 2018

copy-and-paste: I guess we should handle both scaling_XXX and cpuinfo_XXX. There are still a lot of users running 3.x with old distros.

SuperQ on 25 Jan 2018

@matthiasr What about the issue that the cpuinfo_cur_freq file is only readable by root? In theory it should work with CAP_DAC_READ_SEARCH. Slightly less dangerous than running as root.

SuperQ on 26 Jan 2018

Yeah, that's also a problem. I think the collector should try its best to get all the information it can, and we can document the limitations ("on Linux < 3.18, run as root or set the following capabilities"). If no fallback works, the log message could then give actionable advice.

matthiasr on 26 Jan 2018

👍1

I think this is still an issue, re-opening.

SuperQ on 12 Jun 2018

I've started working on making the parsing of these files more robust. See https://github.com/prometheus/procfs/pull/94.

In addition, I tracked down more precisely when Redhat backported the fix to RHEL7:

* Tue Nov 01 2016 Rafael Aquini <[email protected]> [3.10.0-518.el7]
- [cpufreq] expose scaling_cur_freq sysfs file for set_policy() drivers (Oleksandr Natalenko) [1382608]

SuperQ on 13 Jun 2018

This should be fixed by #1117

SuperQ on 18 Oct 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

couldn't get SNTP reply: ....connection refused"

shamimgeek · 3Comments

S.M.A.R.T monitoring

xens · 4Comments

node_filesystem_device_error in v0.15.0 yet v0.14.0 works

tmegow · 5Comments

Feature request: ARM6 Docker image

mhansen · 4Comments

Add total number of running processes metric

mInrOz · 5Comments