Kibana: [Metrics UI] Enhanced host details - Processes - Additional Features

Created on 20 Nov 2020  路  26Comments  路  Source: elastic/kibana

Some acceptance criteria from https://github.com/elastic/kibana/issues/80307 still needs to be met:

  • De-couple the queries for the data needed for the expanded view and only make that request for an individually expanded accordion on demand, rather than loading it as part of the list query for all processes.

Note: We are NOT going to implement the sparkline visualizations from the design, to avoid performance bucketing problems.

  • ~Implement the View Trace in APM button~ Now handled by https://github.com/elastic/kibana/issues/84849
    > APM correlation
    >
    > host.name, process.pid, @timestamp can be used to correlate a process with an APM service, that can place an APM agent language icon beside the process name on the table.

Discussion about how to link to APM is partially here -> https://github.com/elastic/kibana/issues/80307#issuecomment-716916730

  • ~Group queries by process.command_line instead of system.process.cmdline for proper ECS adherence (https://github.com/elastic/beats/pull/22325)~ Holding off on this; seems to be buggy
Metrics UI logs-metrics-ui enhancement

Most helpful comment

@jasonrhodes correct, @alex-fedotyev and I discussed this and came to the conclusion that linking to a service is better than to a particular trace.

All 26 comments

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@elastic/apm @sqren @sorantis @nehaduggal can any of you help us understand the "View Trace in APM" requirement in this ticket? How can we most easily/safely link to APM from a running process with host.name, process.pid, and @timestamp? And would we be linking to a _trace_ or would it more likely be to a service page? Thanks!

UPDATE: I just saw this comment -> https://github.com/elastic/kibana/issues/80307#issuecomment-716916730

It looks like there is still some ambiguity about where exactly to link?

After meeting with @hbharding, @sorantis, @simianhacker, @Zacqary, @phillipb, and @katefarrar about this work, we decided to make a few performance and UX optimizations.

Step 1 (these will be added to the AC for this ticket):

  • Remove the sparkline visualizations from the process list view
  • De-couple the queries for the data needed for the expanded view and only make that request for an individually expanded accordion on demand, rather than loading it as part of the list query for all processes.

Step 2 (this will be moved to a new, separate ticket):

  • Consider whether we need to be querying for all processes or if, more likely, a Top N Processes by X view would likely be useful for users and more performant. Still need to figure out a good value or set of values for N (10? 20?), and a way to easily switch X (CPU, Memory)

It looks like there is still some ambiguity about where exactly to link?

In that discussion we talked about linking to a service given a host.name and process.pid. I gave the follow suggestion:

# link:
/app/apm/link-to/service?host=my-host&pid=my-pid 

# redirects to:
/app/apm/my-java-service/overview

If we want to link to a specific trace some additional info is needed. Obviously if you have access to a trace.id that would solve it since we already have /app/apm/link-to/trace/${trace.id}. Without trace.id I'm not sure how we can do it. So linking to a service might do it for now.

@jasonrhodes The sparkline provides a quick overview of recent process behavior and the plan was to standardize the use of sparklines for other table views, e.g. containers, pods, processes, and (in the future) functions. We're already using it in other places, like APM. I'd not consider removing it until we have a good understanding of the actual performance impact.

After meeting with @hbharding, @sorantis, @simianhacker, @Zacqary, @phillipb, and @katefarrar about this work, we decided to make a few performance and UX optimizations.

Can we start by getting a performance baseline of the current query vs new ones?

  1. Current query with sparklines
  2. Current query without sparklines
  3. Query for TOP 10/20/30 with sparklines
  4. Query for TOP 10/20/30 without sparklines

@sorantis if we don't get rid of the sparklines we are sort of back at square one on what to do with this ticket. The performance complexity that the sparklines introduce is exponential because of max bucket size problems. I think choosing to standardize on those may end up creating a ton of issues across the app, if we aren't very careful.

@sgrodzicki yeah we can do some query comparisons. I was hoping we could do something quick and easy now and do something more holistic after that, but it sounds like we need more information. I'll work with @Zacqary to look at those query numbers and we'll report back.

@sqren I wasn't sure if that "link-to" link exists already? If it does, that looks like it would be perfect for us.

As for trace vs service, it feels like we want to link to service. @sorantis can you confirm?

@jasonrhodes correct, @alex-fedotyev and I discussed this and came to the conclusion that linking to a service is better than to a particular trace.

OK so for the query performance issues, let's do what @sgrodzicki suggested before we make any changes to the UI/queries. To do this, I'd like to see a full example query for each of the following 4 scenarios attached to this ticket, so we can reference them later.

  1. Current query, as-is
  2. Current query, without the data needed for sparklines
  3. Query for TOP 10/20/30 (with the sparkline data)
  4. Query for TOP 10/20/30 (without the sparkline data)

Then we can profile those queries against the dev-next cluster and see the differences in timing. That's still just one somewhat arbitrary set of data, but it's at least a start so we can understand which decision makes sense for now, as well as for moving forward.

Let me know if that doesn't make sense.

@sqren I wasn't sure if that "link-to" link exists already? If it does, that looks like it would be perfect for us.

It doesn't. But it should be trivial to implement if this is what you need.

1. Current Request (from /api/infra/metrics_api)

This request is made via the new Metrics API. It is called multiple times with the afterKey until all there are no more available results. Once we have all the results then it's sorted in memory based on the field requested. The fields for the meta key are extracted from the last bucket.

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-15m",
              "lte": "now"
            }
          }
        },
        {
          "exists": {
            "field": "system.process.cmdline"
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "groupings": {
      "composite": {
        "size": 9,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "system.process.cmdline"
              }
            }
          }
        ]
      },
      "aggs": {
        "histogram": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "30s",
            "offset": "0s",
            "extended_bounds": {
              "min": "now-15m",
              "max": "now"
            }
          },
          "aggregations": {
            "cpu": {
              "avg": {
                "field": "system.process.cpu.total.norm.pct"
              }
            },
            "memory": {
              "avg": {
                "field": "system.process.memory.rss.pct"
              }
            },
            "meta": {
              "top_hits": {
                "size": 1,
                "sort": [
                  {
                    "@timestamp": {
                      "order": "desc"
                    }
                  }
                ],
                "_source": [
                  "system.process.cpu.start_time",
                  "system.process.state",
                  "user.name",
                  "process.pid"
                ]
              }
            }
          }
        }
      }
    }
  }
}
2. Current query (without data for sparklines)

This query doesn't really exist but I imagine this is what it would look like. It would need to be a custom query that uses the same mechanism to retrieve all the results and sort in memory just like the Metrics API request.

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1m",
              "lte": "now"
            }
          }
        },
        {
          "exists": {
            "field": "system.process.cmdline"
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "groupings": {
      "composite": {
        "size": 9,
        "sources": [
          {
            "groupBy0": {
              "terms": {
                "field": "system.process.cmdline"
              }
            }
          }
        ]
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.norm.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.cpu.start_time",
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        }
      }
    }
  }
}

3. Top N for Process with Sparklines

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-15m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "processes": {
      "terms": {
        "field": "system.process.cmdline",
        "size": 20,
        "order": {
          "cpu": "desc"
        }
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "time": {
          "max": {
            "field": "system.process.cpu.start_time"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        },
        "sparklines": {
          "date_histogram": {
            "field": "@timestamp",
            "fixed_interval": "1m",
            "extended_bounds": {
              "max": "now",
              "min": "now-15m"
            }
          },
          "aggs": {
            "cpu": {
              "avg": {
                "field": "system.process.cpu.total.pct"
              }
            },
            "memory": {
              "avg": {
                "field": "system.process.memory.rss.pct"
              }
            }
          }
        }
      }
    }
  }
}

4. Top N for Process without Sparklines

GET metricbeat-*/_search
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-1m",
              "lte": "now"
            }
          }
        },
        {
          "term": {
            "host.name": "gke-dev-next-oblt-dev-next-oblt-pool-404d7f0c-lr5c"
          }
        }
      ]
    }
  },
  "aggs": {
    "processes": {
      "terms": {
        "field": "system.process.cmdline",
        "size": 20,
        "order": {
          "cpu": "desc"
        }
      },
      "aggs": {
        "cpu": {
          "avg": {
            "field": "system.process.cpu.total.pct"
          }
        },
        "memory": {
          "avg": {
            "field": "system.process.memory.rss.pct"
          }
        },
        "time": {
          "max": {
            "field": "system.process.cpu.start_time"
          }
        },
        "meta": {
          "top_hits": {
            "size": 1,
            "sort": [
              {
                "@timestamp": {
                  "order": "desc"
                }
              }
            ],
            "_source": [
              "system.process.state",
              "user.name",
              "process.pid"
            ]
          }
        }
      }
    }
  }
}
  1. Current query, as-is

Tested this on the gke-* hosts on dev-next. Took 40 seconds the first time, but as I continued retrying it to get an average, the query time decreased by about 10 seconds each time until finally settling around 15-18 seconds.

  1. Current query, without the data needed for sparklines

For this test, I reduced the timerange to look back at 1 minute of data instead of 15 minutes.

These took between 5-10 seconds, though this was after running the first test, so it may still have been benefiting from whatever caching caused the sparkline-compatible query to drastically speed up after multiple refreshes

Will post results of the TOP queries after testing them.

I updated the ranges to reflect what we are doing in production. Everything with the data histogram should be the last 15 minutes and anything without should be the last 1 minute. To be fair, we should probably make 2 requests. One for the summary data for the last 1 minute and one for the sparklines which is the last 15 minutes.

@Zacqary On the subsequent requests for #1, did you change the after key? That would make it more realistic because they are paginating.

@simianhacker I didn't, I just kept refreshing the page because I wanted to get an average, but then it turned out it sped up.

This was done through the Inventory view piping it through the Metrics API instead of using the Dev Tools to make a query directly, measuring the XHR time.

  1. Top N for Process with Sparklines

Ran this in the Dev Tools. This took about 10 seconds initially but only 100ms on subsequent requests.

  1. Top N for Process without Sparklines

Between 50-100ms in the Dev Tools, but I'm not sure if that's accurate since it's probably taking advantage of Query 3's cache.

EDIT: I ran Query 4 again using a timerange from several weeks ago and it took 790ms, so that's probably a more accurate non-cached reading.

@Zacqary I think you can break the cache by changing the host name

@simianhacker Tried that; only playing with the date broke the cache

Updated the AC just now to make sure we group by process.command_line instead of system.process.cmdline due to some migration changes in Metricbeat

@Zacqary in case you didn't try yet, you can also use request_cache=false or set profile to true. They all break different caching mechanisms I think.

You could perhaps also use top_metrics instead of top_hits. It should support keywords now as well and might be faster.

We're also using sparklines in one of our new pages. I had to separately fetch the sparklines data because I was running into the too_many_buckets exception.

OK so 5-10 seconds for "current query as-is, without sparklines" is still pretty not great, especially compared to the Top N queries without sparklines (which I agree, should definitely be queried separately, if we're doing them).

Is there a real reason we want to preserve the pagination for this UX or are we just sticking to it because we had it originally? If there isn't a great reason to preserve it, I think we should move forward with Top 10 (or Top 15) and just let clicking on the headers sort with a new query. Then we should log a separate ticket to do the sparklines query separately and add those back into the design. We can always rethink the sorting later, as well. We'll need help from @hbharding to figure out how to show people we are showing them Top 10 Memory, Top 10 CPU, but I don't think we should get too hung up on the sorting.

cc @sorantis @simianhacker @sgrodzicki Thoughts?

I'm focusing on the View Trace in APM functionality now, and I want to clarify what we're expecting. Should this button still say "Trace" if it's linking to a service, and not a trace? Or is there a way that we can fetch a relevant trace.id with another ES query?

Also, I'm testing on edge-oblt and querying APM for one of the hostnames in the Inventory view doesn't return any results. Do we always want the "View in APM" button to be present, even if clicking it will return no results like this? Is there a way we can check to make sure clicking the button will actually link to something?

@sqren Should we consider the View in APM button dependent on #84814?

@Zacqary Yes - unless you have another way of linking to APM you are dependent on #84814?. With only 8 workings left before FF it might be hard to squeeze into 7.11 (with all the other stuff we are also trying to cram in) but we'll do our best.

Okay, I've split the APM button into a separate issue in that case: https://github.com/elastic/kibana/issues/84849

Was this page helpful?
0 / 5 - 0 ratings