@SuperQ @rtreffer
uname -aLinux hostname 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
node_exporter -versionnode_exporter, version 0.14.0 (branch: zoneinfo_collector, revision: 97643f5dcb1f5b637977e8f99892dd55d0b34cac)
build user: trangoni@sargas
build date: 20170814-10:37:55
go version: go1.8.3
No
# /usr/sbin/node_exporter --collectors.enabled 'cpu'
INFO[0000] Starting node_exporter (version=0.14.0, branch=zoneinfo_collector, revision=97643f5dcb1f5b637977e8f99892dd55d0b34cac) source="node_exporter.go:137"
INFO[0000] Build context (go=go1.8.3, user=trangoni@sargas, date=20170814-10:37:55) source="node_exporter.go:138"
INFO[0000] Enabled collectors: source="node_exporter.go:157"
INFO[0000] - cpu source="node_exporter.go:159"
INFO[0000] Listening on :9100 source="node_exporter.go:183"
ERRO[0003] ERROR: cpu collector failed after 0.003984s: open /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory source="node_exporter.go:94"
node_cpu_frequency_hertz metric or, well, that the node_exporter handles the missing file and shows no errors.See how it looks like on CentOS 7.3 Haswell Server for 'cpu0',
# for file in /sys/bus/cpu/devices/cpu0/cpufreq/*; do echo "$file: $(cat $file)" ; done
/sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq: 1200195
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq: 3300000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency: 4294967295
/sys/bus/cpu/devices/cpu0/cpufreq/related_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors: performance powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver: intel_pstate
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor: powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq: 3300000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed: <unsupported>
See also Red Hat Bugzilla #1085525 as it seems, it won't be fixed upstream.
cpuinfo_cur_freq works as expected but this is only accessible by root,
-r-------- 1 root root 4096 Aug 16 12:36 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq
No cpu_freq and thermal_throttle metrics at all. However, the node_exporter warns about the scaling_cur_freq and outputs errors.
Well, It seem to be fixed upstream with this commit,
3.18-rc2:
commit c034b02e213d271b98c45c4a7b54af8f69aaac1e
Author: Dirk Brandewie <[email protected]>
Date: Mon Oct 13 08:37:40 2014 -0700
cpufreq: expose scaling_cur_freq sysfs file for set_policy() drivers
And fortunately this is fixed in RHEL7.4 too,
# rpm -q kernel-3.10.0-693.el7.x86_64 --changelog | grep scaling_cur_freq
- [cpufreq] expose scaling_cur_freq sysfs file for set_policy() drivers (Oleksandr Natalenko) [1382608]
See,
# for file in /sys/bus/cpu/devices/cpu0/cpufreq/*; do echo "$file: $(cat $file)" ; done
/sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq: 1486007
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq: 1900000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency: 4294967295
/sys/bus/cpu/devices/cpu0/cpufreq/related_cpus: 0
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors: performance powersave
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: 1486007
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver: intel_pstate
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor: performance
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq: 1900000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq: 1200000
/sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed: <unsupported>
We will have to wait for CentOS 7.4.
Feel free to consider handling this case or closing the issue.
That's a pretty annoying kernel bug. I think the node_exporter handling in this case is ok. It knows that there should be a file to read because the directory is there. It produces an error because the file permissions are messed up.
Thanks for documenting the bug so others can find the fix.
@SuperQ Just to clarify: This is not a permission problem. The file scaling_cur_freq simply does not exist and the cpu loop returns prematurely - even though the other files do exist and could be used to provide interesting metrics.
IMHO it would be nice if this case would be handle more gracefully as it causes lots of missing metrics on RHEL/CentOS 7 systems if they are not already running on the very latest 7.4 version (which is not even released in the case of CentOS).
@knweiss Hmm, that's a tough call. I would prefer to fail the collector rather than return partial results in this case. But, it may be worth changing this policy.
If a metric doesn't exist due to the software version not supporting it, it's fine not to return it. What's not fine is sometimes returning it and sometimes not, and that sort of failure is where you need to be careful with partial results.
Yes, the design idea here is that if you get metrics from a specific collector, it will return "all" of the metrics you would expect from that collector. Having partial results breaks this design idea as @brian-brazil points out.
@brian-brazil @SuperQ I have one question though: Does the partial results rule apply to a) the cpu collector as a whole or b) to each of its major sections (UpdateStat metrics, cpufreq metrics, thermal_throttle metrics)?
If the partial results rule applies to the entire cpu collector it would be broken on RHEL 7.3 because it only returns the UpdateStat() metrics because of the missing scaling_cur_freq file. I.e. a partial result.
Also, please take a look at the cpu loop in updateCPUfreq(). If one of the directories cpufreq or thermal_throttle is missing, the code logs a debug message (but does not return an error!) and continues. This causes partial results by definition a).
OTOH: If the partial results rule applies to each of the cpu collector's major sections the thermal_throttle metrics section (its files exist even on RHEL 7.3!) in the cpu loop probably should not be omitted entirely because of the early return in the cpufreq section (caused by the missing scaling_cur_freq file check that is execute before).
Yea, it's a bit of a mess. My view is b, the major sections. We could split out the cpufreq and thermal_throttle sections as separate collectors. These are only useful for hardware users, and not VM users.
I'd be fine with that.
I saw a fix was merged in via https://github.com/prometheus/node_exporter/pull/657
However, I am running node-exporter v0.15.1 on RHEL 7.3 and still see the same issue ...
kubernetes.host:ip-10-103-105-37.eu-central-1.compute.internal kubernetes.container_name:node-exporter @timestamp:December 20th 2017, 15:09:21.262 log:time=\"2017-12-20T15:09:21Z\" level=error msg=\"ERROR: cpu collector failed after 0.000752s: open /host/sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory\" source=\"collector.go:123\"\n
It looks look like this error causes the pod to become SandboxChanged and restarts the pod.
Note: The file doesn't exist on the node that that part is accurate.
node_exporter: v0.15.1 running out of systemd
os: centos
machine: r4.large aws-ec2
im also seeing these issues with the 0.15.1 release of node_exporter outside of Kubernetes:
Jan 16 08:53:19 elasticsearch-01 node_exporter[9723]: time="2018-01-16T08:53:19-06:00" level=error msg="ERROR: cpu collector failed after 0.000357s: open /sys/bus/cpu/devices/cpu0/cpufreq/scaling_cur_freq: no such file or directory" source="collector.go:123"
uname:
Linux elasticsearch-01 3.10.0-514.21.1.el7.x86_64 #1 SMP Thu May 25 17:04:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Running from EC2 r4.large, and can confirm that the file doesn't exist.
Is there a way to ignore these errors or ignore the given file?
# find /sys/bus/cpu/devices/*/cpufreq/ -ls
13172 0 drwxr-xr-x 2 root root 0 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/
13180 0 -rw-r--r-- 1 root root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_governor
13175 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_transition_latency
13181 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_driver
13184 0 -r-------- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_cur_freq
13182 0 -r--r--r-- 1 root root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_available_governors
13174 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_max_freq
13173 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/cpuinfo_min_freq
13177 0 -rw-r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_max_freq
13178 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/affected_cpus
13176 0 -rw-r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_min_freq
13179 0 -r--r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/related_cpus
13183 0 -rw-r--r-- 1 root root 4096 Jan 16 08:50 /sys/bus/cpu/devices/cpu0/cpufreq/scaling_setspeed
13185 0 drwxr-xr-x 2 root root 0 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/
13193 0 -rw-r--r-- 1 root root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_governor
13188 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_transition_latency
13194 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_driver
13197 0 -r-------- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_cur_freq
13195 0 -r--r--r-- 1 root root 4096 Jun 1 2017 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_available_governors
13187 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_max_freq
13186 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/cpuinfo_min_freq
13190 0 -rw-r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_max_freq
13191 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/affected_cpus
13189 0 -rw-r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_min_freq
13192 0 -r--r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/related_cpus
13196 0 -rw-r--r-- 1 root root 4096 Jan 16 08:53 /sys/bus/cpu/devices/cpu1/cpufreq/scaling_setspeed
What is the affected Linux version range? Does this mean 3.18 is the oldest supported Linux version? If this is the case, we should document this restriction.
copy-and-paste: I guess we should handle both scaling_XXX and cpuinfo_XXX. There are still a lot of users running 3.x with old distros.
@matthiasr What about the issue that the cpuinfo_cur_freq file is only readable by root? In theory it should work with CAP_DAC_READ_SEARCH. Slightly less dangerous than running as root.
Yeah, that's also a problem. I think the collector should try its best to get all the information it can, and we can document the limitations ("on Linux < 3.18, run as root or set the following capabilities"). If no fallback works, the log message could then give actionable advice.
I think this is still an issue, re-opening.
I've started working on making the parsing of these files more robust. See https://github.com/prometheus/procfs/pull/94.
In addition, I tracked down more precisely when Redhat backported the fix to RHEL7:
* Tue Nov 01 2016 Rafael Aquini <[email protected]> [3.10.0-518.el7]
- [cpufreq] expose scaling_cur_freq sysfs file for set_policy() drivers (Oleksandr Natalenko) [1382608]
This should be fixed by #1117