Kibana: APM: Fix and finalize memory and CPU chart metric series

Created on 18 Dec 2018  路  17Comments  路  Source: elastic/kibana

UPDATE (from comment thread as of 11 Jan 2019 12:08 ET)

For the memory chart, we're ditching the process memory metrics altogether for now, and going with an "average + worst-case scenario" pair of metrics for this release. Originally, that was going to be "average available + minimum available" for each bucket, which tells users the overall health of all hosts and notifies them if there are outliers. Showing the "minimum" value means we need to switch to % instead of raw GBs for this chart, because the min value in GB will be meaningless without knowing the total GB available on that host, which we can't show.

Unfortunately, having "available memory, percentage" sitting next to "cpu used, percentage" makes for a confusing experience, since the peaks and troughs of the two graphs mean exactly the opposite of each other. For this reason, we're switching the memory chart to show "average used % + maximum used %", which still satisfies the "average + worst-case scenario" requirement but aligns the chart better with its CPU neighbor.

tl;dr the memory chart will be a line graph showing:

  • Average memory used, %
  • Maximum memory used, %

The labels for these series will simply be "System Average" and "System Max", and the chart title will be "Memory usage".


Original issue:

With the release of elastic/kibana#26608 we now have a metrics tab with memory and CPU charts. Based on the original mock ups and design documents, these charts were meant to display the following metrics:

CPU usage graph
System Avg        - system.cpu.total.norm.pct (avg)
System Max        - system.cpu.total.norm.pct (max)
Process Avg       - system.process.cpu.total.norm.pct (avg)
Process Max       - system.process.cpu.total.norm.pct (max)
Memory usage graph
System available memory   - system.memory.actual.free (avg)
System total memory       - system.memory.total (avg)
Process memory size       - system.process.memory.size (avg)
Process RSS               - system.process.memory.rss.bytes (avg)

Since then, we've discovered a number of issues with this set of metrics.

Over-sampled host problem

One problem is that simply pulling the averages from a bucket will unequally weight some hosts that may send more frequent metric data to the APM server within a given time series bucket.

e.g.

12:01:05 Host A (0.20)
12:01:09 Host B (0.21)
12:01:22 Host C (0.64)
12:01:38 Host D (0.41)
12:01:49 Host A (0.21)
12:01:58 Host B (0.20)

If we simply average all the values here in this per-minute bucket, the average would be 0.312 ((0.20 + 0.21 + 0.64 + 0.41 + 0.21 + 0.20) / 6), but that weights A and B more than it should. If we first take an "average-per-host" within each bucket, we would get a more accurate overall value: 0.365 ((0.205 + 0.205 + 0.64 + 0.41) / 4).

Outlier problem

This on its own doesn't help users know if a single outlier host is in bad shape, though ... e.g. if you have 100 hosts and 99 have a per-host-per-bucket average value of 0.5 and a single host has an average value of 0.01, the overall average for that bucket will still be 0.495, hiding the problem with that single host.

To help with outliers, we could sum the per-host averages within each bucket. I don't actually know if this makes a huge difference because in the above example, the sum will still not change much if only one host drops off, but the change may show up slightly better in some cases. (Citation needed lol).

Container problem

If we sum, we could multiple containers running on a machine reporting the same values from multiple different virtual host names, so the sum would inaccurately over-sample those container hosts once again.

Proposed solution, memory chart

For the memory chart, we would change to using these metrics:

  • Total process memory size (sum)
    Total memory used by all hosts in the bucket (after calculating average per host)
  • Minimum system available memory (min)
    Lowest average available memory value within a bucket, helps a user see when a problem with available memory occurs on at least one host
  • Average system available memory (avg)
    Average available memory (after averaging per host) within a bucket, helps a user understand how much of an outlier the min value is

From @roncohen:

the idea would be to show you immediately if any of the systems have low available memory, as well as how much memory in total your service is using

Question: if we are summing all hosts for process memory size, that value won't have much to do with the "available memory" values, so showing them on the same chart with the same Y axis might be confusing?

Proposed solution, CPU chart

Not needed, this chart is fine as-is for now.

apm blocker v6.6.0

Most helpful comment

@jasonrhodes exactly.

I suggest we go with the percentage based memory graph for now. We should show avg and min.

We鈥檙e optimizing for a compromise across users with more than ~10 and users with more than ~10 host. It鈥檚 not going to be ideal for everyone in this first iteration, but it could still provide lots of value.

If you see that the min is significantly less than the avg it could mean something is wrong and you can then go and investigate in more detail.

And it鈥檚 true that for container based setups, the hosts that run many containers will be overrepresented in the avg line. It鈥檚 not ideal, but let鈥檚 get this in the hands of uses and then iterate.

cc @makwarth

All 17 comments

Question: if we are summing all hosts for process memory size, that value won't have much to do with the "available memory" values, so showing them on the same chart with the same Y axis might be confusing?

agreed. That's going to look strange.

Another idea: for memory, I wonder if we could calculate the percentage used for system and for process per document. We could then, for each time bucket, calculate the max and avg, like we have for CPU:

for each document, if we calculate:
(system.memory.total-system.memory.actual.free)/system.memory.total we get the memory used in percent. If we then do the max and avg per time bucket, we end up with system memory usage in percent (max) and system memory usage in percent (avg).

And we can do the same for process. Hopefully that'll give us two graphs with percentages and lines that are useful in all the scenarios?

EDIT: i haven't tested it

so it turns out that we can't reliably get _process_ memory metrics for all agents. Let's completely disregard the the _process memory metrics_ for now. We should focus only on _system_ memory metrics, in addition to the CPU metrics.

If the above suggestion is too risky/complicated for now, I suggest we simply remove

  • Total process memory size (sum)

and stick to

  • Minimum system available memory (min) and
  • Average system available memory (avg)

from the original proposal.

For CPU I didn't come across a problem, so lets stick to that unless you've discovered something.

@roncohen the way we are getting CPU metrics today doesn't take into account the "per-bucket-per-host" averaging before it grabs the average values. I assume you want us to change that, but leave everything else the same as far as which metrics we are showing in that graph?

Then for memory, you want to only show available (min and average, so 2 series on that graph), but not total system memory used, correct? I wonder if we should move to a line graph instead of an area graph for that, then, also?

@jasonrhodes the "per-bucket-per-host" averaging unfortunately will not solve our problem if the user has containers because there the hostname is typically a random string (container ID) anyway (even if they run on the same physical host). That means a host running more containers than other hosts will be over-represented in the average.

For non-container setups, I think the "Over-sampled host problem" will have minimal negative impact, because it's very rare to have different sampling period per host and outside of that we're talking about one measurement more or less as far as i can tell. Finally, the average will anyway obscure any outliers. So in the interest of time, I suggest we can just leave it as it is without accounting for "per-bucket-per-host".

Then for memory, you want to only show available (min and average, so 2 series on that graph), but not total system memory used, correct?

Correct! I think a line graph here would be fine instead of an area graph

@roncohen perfect, thanks!

++ on line graph

Pinging @elastic/apm-ui

@roncohen If we still want relative memory usage we can do this:

{
  "size": 0,
  "query": {
    "query_string": {
      "query": "processor.event:metric"
    }
  },
  "aggs": {
    "memoryUsedRelative": {
      "avg": {
        "script": "(doc['system.memory.total'].value - doc['system.memory.actual.free'].value) * 100.0/doc['system.memory.total'].value"
      }
    }
  }
}

@sqren cool! Do you agree that would be useful?

I think it's a little hard to say without knowing the usecase we are optimizing for.

< 10 hosts
If the user only has 5 hosts we can display all of them on the chart individually. In this case we should probably just show the average free memory per host per bucket.

> 10 hosts
If a user has 1000 hosts, we cannot show all of them. Instead we can show an aggregated visualisation like memory free "average" or "minimum". With "average" it is very easy for a single host to hide. With "minimum" the host with least available memory will show up, but it is not possible to see which one it is, or if there are other hosts with this problem. To solve that we'll need a list view.

If there are more than 10 hosts, we can also just show the top 5 hosts according to some dimension (avg memory free over the entire time period).

the "per-bucket-per-host" averaging unfortunately will not solve our problem if the user has containers because there the hostname is typically a random string (container ID) anyway (even if they run on the same physical host). That means a host running more containers than other hosts will be over-represented in the average.

I'm not sure if this means that we cannot show the hosts individually, or calculate memory per host.
If it does I'm not sure how our current solution provides value to users with many hosts.

I think it's a little hard to say without knowing the usecase we are optimizing for.

My understanding is that we optimize for the > 10 host scenario, only show average and minimum available memory (both as percents), and recognize that in this first iteration, a user with a single host running out of memory will at least _know_ there is a problem, but won't yet have an easy way from our metrics UI to determine which host it is. Is that how you understand it, @roncohen ?

@jasonrhodes exactly.

I suggest we go with the percentage based memory graph for now. We should show avg and min.

We鈥檙e optimizing for a compromise across users with more than ~10 and users with more than ~10 host. It鈥檚 not going to be ideal for everyone in this first iteration, but it could still provide lots of value.

If you see that the min is significantly less than the avg it could mean something is wrong and you can then go and investigate in more detail.

And it鈥檚 true that for container based setups, the hosts that run many containers will be overrepresented in the avg line. It鈥檚 not ideal, but let鈥檚 get this in the hands of uses and then iterate.

cc @makwarth

@roncohen @sqren @makwarth I was working on this today and realized that Ron has been asking for "available memory" stats, with average and min, but that this graph is going to be sitting next to the CPU graph where the numbers go in the opposite direction and we show "CPU usage" stats with average and max. Seems like we should align these so the two graphs have the same meaning (i.e. an up spike should be good for both or bad for both) -- should we do memory usage with average and max instead?

should we do memory usage with average and max instead?

That would have to be max used, instead of min free as-is now (probably also what you are suggesting).
Makes sense to me 馃憤

That would have to be max used, instead of min free as-is now

Yep, that's what I meant by "memory usage", vs what we have now which is "available memory".

These are easy changes, but take a bit, so I'm going to hold off until I hear from Ron and/or Rasmus on a final go-ahead. I'll pick something else up in the meantime.

that sounds good to me 馃憤

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JulienPalard picture JulienPalard  路  95Comments

tbragin picture tbragin  路  81Comments

seti123 picture seti123  路  100Comments

ctindel picture ctindel  路  81Comments

doubret picture doubret  路  105Comments