Beats: Inconsistent CPU Percentage Calculation (Process vs System)

Created on 6 Jun 2017  路  12Comments  路  Source: elastic/beats

per @andrewkroh

The process times are collected using GetProcessTimes. The descriptions of the out params say that the times reported are summed across cores (so you can get greater than 100% usage). The code used by Metricbeat is here.

The overall system CPU time is collected using GetSystemTimes. The behavior is similar to GetProcessTimes. The documentation states, "On a multiprocessor system, the values returned are the sum of the designated times across all processors."

[There is] a difference between overall CPU usage and process CPU usage. In the overall CPU usage calculation the total time value is calculates by summing the parts (i.e. idle + kernel + user). In the process CPU calculation the total time is measured using the difference in wall clock times between samples. Assuming you want 100% to be the max, using wall clock time causes the percentage to be wrong for multi-core systems and inconsistent with the overall CPU percentage value.

I think we need a change to make percentages be consistent so that they can be compared. We need to decide if we want 100% to be max or if we want 100% * number_of_cores to be the max.

Metricbeat bug v6.0.0-alpha2

Most helpful comment

I'd vote for 100% being the max. I think it would be easier to interpret the data because it doesn't require any knowledge of the number of cores (a value that's not reported AFAIK unless you use correlate it to the cores metricset data). Using 100% as max would normalize the data so you can compare values across all systems irregardless of the core count. The downside being that you have lost some sense of the magnitude (2 cores maxed out vs 22 cores maxed out).

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

All 12 comments

Thanks @andrewkroh for the great analysis and find. If we make it 100% * number_of_cores that would be more consistent with what we have on Linux, right? I'd vote for that, in that case, so that people can compare the values across systems.

I changed this to bug so we tackle it for 6.0-GA.

I'd vote for 100% being the max. I think it would be easier to interpret the data because it doesn't require any knowledge of the number of cores (a value that's not reported AFAIK unless you use correlate it to the cores metricset data). Using 100% as max would normalize the data so you can compare values across all systems irregardless of the core count. The downside being that you have lost some sense of the magnitude (2 cores maxed out vs 22 cores maxed out).

Irregardless of the final decision, I think it would be useful to include number of cores as a metric in any CPU related metricsets.

After the discussion we had yesterday, my vote would be to support both use cases "natively". Perhaps in the process metricset we could have two metrics:

  • system.process.cpu.total.pct - this one can go over 100%, like in top
  • system.process.cpu.total.normalized.pct. Defined as cpu.total.pct / number_of cores. It's max 100%.

My understanding is that this would be backwards compatible since in the current version system.process.cpu.total.pct can go over 100% on all platforms.

With the above, the cpu metricset (system wide) should use the same convetions but there are two issues:

  • it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)
  • There are multiple CPU times in that metricset data.json. Adding two values for each would cause a significant increase in disk space.

Perhaps we could add the normalized values behind an option, like we do for ticks?

Regardless, I think we should also export system.cpu.number_of_cores as a new metric.

Any chance this can be backported to 5.x ?

Hmm, perhaps we can backport the non-BWC bits. Let's first have a concrete PR and we can discuss on it.

it would be a BWC change for the cpu.total.pct field, at least on Windows (is it on all platforms?)

Yeah, this would affect all platforms.

I just noticed that we use norm in the load metricset for the normalized load values. It would be inconsistent to use normalized. Should we

  1. change load to use normalized,
  2. use norm in cpu, core, and process,
  3. or be inconsistent and not change load?

Perhaps we could add the normalized values behind an option, like we do for ticks?

Instead of adding additional include_normalized or normalized.enable options. I propose we let the user specify a list so that they can pick and choose what to include. And this would deprecate the cpu_ticks option.

  load.metrics: [averages, normalized_averages]
  cpu.metrics:  [percentages, normalized_percentages, ticks]
  core.metrics: [percentages, ticks]

@ruflin pointed out that norm is in the guidelines: https://www.elastic.co/guide/en/beats/libbeat/current/event-conventions.html#abbreviations

Not of a fan of the abbreviation, but I think we should use norm consistently in this case.

I think this can be closed.

Was this page helpful?
0 / 5 - 0 ratings