Telegraf: Add cpu_speed (in mhz) to cpu or system measurement

Created on 8 Jun 2018  路  9Comments  路  Source: influxdata/telegraf

Add cpu_speed to cpu or system measurement

We are currently in the process of converting from Ganglia to Telegraf. (yeah!) Unfortunately, we have some existing dependence on the Ganglia cpu_speed metric. This is not found in Telegraf.

Proposal:

Add a cpu_speed field or equivalent to the cpu or system measurement. This would be in MHz.

Use case: [Why is this important (helps with prioritizing requests)]

This helps mostly in the capacity management area, when mapping cpu mhz of an application group that is targeted for migration to new hosts. We can get the cpu speed other ways, of course, but having it directly and natively in Telegraf would be optimal.

aresystem feature request

Most helpful comment

here is P-O-C graded collector meant to be used as Exec input in Telegraf:

https://github.com/jose-d/telegraf-collectors/blob/master/cpufreq-monitor/give_stats.py

at the end I collect the data from

/sys/devices/system/cpu/cpuNN/cpufreq/scaling_cur_freq as it is readable (Centos7) by non-root user.

screenshot from Grafana:

(it's actually showing the reason why this monitoring is useful for me - detecting suboptimal usage of CPU resources by $users )

Screenshot_2020-07-15 node details - Grafana

All 9 comments

A quick look suggests that we could use the cpu.Info() function from gopsutil to pull in some additional cpu fields:

type InfoStat struct {
    CPU        int32    `json:"cpu"`
    VendorID   string   `json:"vendorId"`
    Family     string   `json:"family"`
    Model      string   `json:"model"`
    Stepping   int32    `json:"stepping"`
    PhysicalID string   `json:"physicalId"`
    CoreID     string   `json:"coreId"`
    Cores      int32    `json:"cores"`
    ModelName  string   `json:"modelName"`
    Mhz        float64  `json:"mhz"`
    CacheSize  int32    `json:"cacheSize"`
    Flags      []string `json:"flags"`
    Microcode  string   `json:"microcode"`
}

This reads and parses /proc/cpuinfo on Linux.

Well it depends on what we're really looking for here. Are we wanting maximum speed, or current speed? What about the max or min limits?

I was looking for just the "CPU MHz" field from the linux command 'lscpu', which appears to be the same as the "cpu MHz" field of each CPU core from 'cat /proc/cpuinfo'. This doesn't change for me and matches the CPU description, like "Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz". I'm on a VMware infrastruture, though.

I was not looking for instantaneous frequency or even max boost frequency, just the frequency that corresponds to the CPU description--which when combined with the CPU core count can help give some comparative sense of capacity among VMs and environments.

This field is the instantaneous frequency of the processor, but there is also min and max, here it is on my laptop:

CPU MHz:             499.877
CPU max MHz:         3400.0000
CPU min MHz:         400.0000

Even though min/max don't change, I can see the usefulness across a fleet of systems of collecting them. I think the main thing we should decide is if we want collecting this data to be opt-in or if it is light enough we should just add it. I think we can just add these 3 fields in as part of the standard fields collected by the cpu plugin since it should be a fairly light amount of extra load.

What about the limits? Limits might be useful on embedded (or other) systems which adjust the limits to conserve power.
Dunno if gopsutil provides them all in one spot, but they can all be obtained from /sys/devices/system/cpu/cpu*/cpufreq/

Basically all the fields and their relationships with each other are:
cpuinfo_min_freq <= scaling_min_freq <= scaling_cur_freq <= scaling_max_freq <= cpuinfo_max_freq

This feature request should probably also be reconciled with this PR:
https://github.com/influxdata/telegraf/pull/4215

I tried in the past to read lscpu and to input the data into InfluxDB using the exec input plugin.

The values were always higher than what I would get by running the same command from the command line because by the time telegraf gets to run the plugin, the CPU or kernel already increased the frequency.

I would say that the plugin makes little sense, unless it is proven to provide reliable values.

I'm running a E3-1220 v2 on Ubuntu 18.04.

I tried in the past to read lscpu and to input the data into InfluxDB using the exec input plugin.
The values were always higher than what I would get by running the same command from the command line because by the time telegraf gets to run the plugin, the CPU or kernel already increased the frequency.

I would say that the plugin makes little sense, unless it is proven to provide reliable values.

I see what you mean. In usecases like mine, (having XX cores HPC machine) one could assume the noise introduced by Telegraf itself can be expected to affect just few (?) cores (?). Anyway, going to write some exec() collection of /sys/devices/system/cpu/cpuXXX/cpufreq/cpuinfo_cur_freq and keep it running for some weeks on few compute nodes to see the real-life results.

here is P-O-C graded collector meant to be used as Exec input in Telegraf:

https://github.com/jose-d/telegraf-collectors/blob/master/cpufreq-monitor/give_stats.py

at the end I collect the data from

/sys/devices/system/cpu/cpuNN/cpufreq/scaling_cur_freq as it is readable (Centos7) by non-root user.

screenshot from Grafana:

(it's actually showing the reason why this monitoring is useful for me - detecting suboptimal usage of CPU resources by $users )

Screenshot_2020-07-15 node details - Grafana

Was this page helpful?
0 / 5 - 0 ratings

Related issues

mrcheeky123 picture mrcheeky123  路  3Comments

yn1v picture yn1v  路  3Comments

veerendra2 picture veerendra2  路  3Comments

aihysp picture aihysp  路  3Comments

timhallinflux picture timhallinflux  路  3Comments