Beats: Inconsistent CPU Percentage Calculation (Process vs System)

Created on 6 Jun 2017 · 12Comments · Source: elastic/beats

per @andrewkroh

The process times are collected using GetProcessTimes. The descriptions of the out params say that the times reported are summed across cores (so you can get greater than 100% usage). The code used by Metricbeat is here.

The overall system CPU time is collected using GetSystemTimes. The behavior is similar to GetProcessTimes. The documentation states, "On a multiprocessor system, the values returned are the sum of the designated times across all processors."

[There is] a difference between overall CPU usage and process CPU usage. In the overall CPU usage calculation the total time value is calculates by summing the parts (i.e. idle + kernel + user). In the process CPU calculation the total time is measured using the difference in wall clock times between samples. Assuming you want 100% to be the max, using wall clock time causes the percentage to be wrong for multi-core systems and inconsistent with the overall CPU percentage value.

I think we need a change to make percentages be consistent so that they can be compared. We need to decide if we want 100% to be max or if we want 100% * number_of_cores to be the max.

Metricbeat bug v6.0.0-alpha2

Source

PhaedrusTheGreek

Most helpful comment

I'd vote for 100% being the max. I think it would be easier to interpret the data because it doesn't require any knowledge of the number of cores (a value that's not reported AFAIK unless you use correlate it to the cores metricset data). Using 100% as max would normalize the data so you can compare values across all systems irregardless of the core count. The downside being that you have lost some sense of the magnitude (2 cores maxed out vs 22 cores maxed out).

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

andrewkroh on 6 Jun 2017

👍3

All 12 comments

Thanks @andrewkroh for the great analysis and find. If we make it 100% * number_of_cores that would be more consistent with what we have on Linux, right? I'd vote for that, in that case, so that people can compare the values across systems.

tsg on 6 Jun 2017

I changed this to bug so we tackle it for 6.0-GA.

tsg on 6 Jun 2017

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

andrewkroh on 6 Jun 2017

👍3

After the discussion we had yesterday, my vote would be to support both use cases "natively". Perhaps in the process metricset we could have two metrics:

system.process.cpu.total.pct - this one can go over 100%, like in top
system.process.cpu.total.normalized.pct. Defined as cpu.total.pct / number_of cores. It's max 100%.

My understanding is that this would be backwards compatible since in the current version system.process.cpu.total.pct can go over 100% on all platforms.

With the above, the cpu metricset (system wide) should use the same convetions but there are two issues:

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)
There are multiple CPU times in that metricset data.json. Adding two values for each would cause a significant increase in disk space.

Perhaps we could add the normalized values behind an option, like we do for ticks?

Regardless, I think we should also export system.cpu.number_of_cores as a new metric.

tsg on 8 Jun 2017

👍2

Any chance this can be backported to 5.x ?

PhaedrusTheGreek on 9 Jun 2017

👍2

Hmm, perhaps we can backport the non-BWC bits. Let's first have a concrete PR and we can discuss on it.

tsg on 12 Jun 2017

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)

Yeah, this would affect all platforms.

andrewkroh on 12 Jun 2017

I just noticed that we use norm in the load metricset for the normalized load values. It would be inconsistent to use normalized. Should we

change load to use normalized,
use norm in cpu, core, and process,
or be inconsistent and not change load?

andrewkroh on 21 Jun 2017

Perhaps we could add the normalized values behind an option, like we do for ticks?

Instead of adding additional include_normalized or normalized.enable options. I propose we let the user specify a list so that they can pick and choose what to include. And this would deprecate the cpu_ticks option.

  load.metrics: [averages, normalized_averages]
  cpu.metrics:  [percentages, normalized_percentages, ticks]
  core.metrics: [percentages, ticks]

andrewkroh on 22 Jun 2017

👍1

@ruflin pointed out that norm is in the guidelines: https://www.elastic.co/guide/en/beats/libbeat/current/event-conventions.html#abbreviations

Not of a fan of the abbreviation, but I think we should use norm consistently in this case.

tsg on 22 Jun 2017

I think this can be closed.

tsg on 30 Jun 2017

Related PRs:

5.5 - https://github.com/elastic/beats/pull/4544
6.0 (master) - https://github.com/elastic/beats/pull/4550 (backport) https://github.com/elastic/beats/pull/4553 (add normalized values to 6.0)

andrewkroh on 9 Aug 2017

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Allow to add condition that matches all events in autodiscover templates

jsoriano · 3Comments

Functionbeat failing to deploy due to Alphanumeric Error

musayev-io · 3Comments

Filebeat get the wrong Kubernetes Node

pigletfly · 3Comments

[Filebeat] Kubernetes metadata fields not expanded in rollover_alias and policy_name

TomaszKlosinski · 3Comments

[Filebeat] Add the ability to specify a custom path for autodiscover

feelan03 · 3Comments