Kibana: [Observability] Homepage experience (Milestone 1)

Created on 3 Jun 2020  路  16Comments  路  Source: elastic/kibana

Summary

As a continuation of https://github.com/elastic/kibana/issues/66931 we're looking to add a new view that will serve as the overview page when users have existing data available for either Logs, Metrics, APM or Uptime.

Design proposal

鈻讹笍 Figma prototype

01 Observability - Overview - Full

Chart panels

The overview page will consist of a number of sections per area of Observability, each containing a number of chart visualizations that will be based on a high-level data query e.g. the number of log events by log source.

Logs

Logs

The proposed chart panel for logs will be a log rate histogram grouped by the log source. We will be looking for available indices matching the default setup for the Logs app. The list of look ups will be expanded as we investigate further which indices would be interesting to auto-detect and visualize based on other 3-party log vendors, where we know we will have ECS compatible data.

Data query

The log rate visualization already exists in the Log rate tab in the Logs app.

Screenshot 2020-06-04 at 09 51 34

We will use the configured log indices in the Log settings.

Screenshot 2020-06-04 at 09 53 18

_TODO: Perhaps include an example ES query to get the same data_

Metrics

Screenshot 2020-06-23 at 19 38 02

The metrics section will consist of a chart panel based on system metrics aggreated on host metrics only. Kubernetes and container metrics will be looked at in future iterations.

The different aggregates will show:

  • Number of hosts
  • CPU usage (used vs. available)
  • Memory usage (used vs. available)
  • Disk used (used vs. allocated)
  • Inbound traffic MB/s
  • Outbound traffic MB/s

The progress bar visualization will indicate used vs. capacity.

Data query

_TODO: Show ES query example of get the aforementioned data_

APM

APM

The APM data panel will show the number of services, transactions and error rate.

Data query

  • Aggregate total number of services
  • Aggregate total number of processor.event: transaction
  • Aggregate error rate across aggregate of processor.event: transaction

Uptime

Uptime

The Uptime panel will show the number of pings over time grouped by up / down status. The stats will show the total number of monitors and show the number of up and down.

Data query

The uptime monitors visualization already exists in Uptime.

Screenshot 2020-06-04 at 11 25 28

  • Aggregate number of pings grouped by up and down
  • Aggreate total number of monitors
  • Aggregate number of monitors reporting "up"
  • Aggregate number of monitors reporting "down"

_TODO: Show example of ES query_

Alerts and alerts activity

Alerts chart

The alerts section will consist of two panels; the Alerts distribution showing the number of alerts triggered grouped by type.

Alerts activity

The second panel will focus on showing recent activity with direct links to alert detail views. Each alert will have a link to the alert detail page, show a total number of alert instances within the time range selected and tags.

Resources

Resources

As another section of content, we will provide users with options to go straight to the documentation, discuss forum or training resources.

News feed

News feed

The news feed will consist of Observability related blog posts and other industry-related stories.

Kibana news feeds can be set up by providing a .yml feed in the Newsfeed repository and use the Kibana news feed services to show the content.

Observability Landing - Milestone 1 Observability apm v7.9.0

All 16 comments

Pinging @elastic/apm-ui (Team:apm)

Design update - 9 June 2020

We received some feedback on some specific areas of the design, so I've updated the examples above. Here's a quick changelog;

view-in-app

  • Updated the Metrics data panels with a KPI style for the traffic metrics as well and added the number of hosts as per feedback from @sorantis and @cyrille-leclerc
  • Replaced the section "add data" links with a "view in app" option that will link to each individual app for further investigation.

I've also put together a quick responsive layout example of how we want to let the data panels grow while retaining a fixed width for the alert column. Allowing the primary data panels (logs, metrics etc.) to grow means inspecting the visualizations become easier on larger screens, whereas the alert visualization and activity feed don't necessarily benefit all that much from growing larger.

00 Overview - Responsive layout guideline

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

It's worth noting that the initial scope for Metrics is Hosts only. Kubernetes and containers will not considered in future iterations.

Thanks @sorantis I've made a note of it in the Metrics section in the description along with more specifics around each metric. I additionally updated the traffic metrics to not show a progress bar visualization because it's simply the aggregated traffic metrics we'll show (not a typical used vs. allocated) which was indicated. Mostly due to copying over the same stat component as the others, I forgot to remove it.

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

@formgeist what do you think about adding a tiny graph underneath the traffic metrics can show instead of a progress bar?

We should be able to graph a time-series chart underneath, so here's an example of using a sparkline for the traffic metrics.

Metrics

Thoughts?

  • Aggregate total number of processor.event: transaction
  • Aggregate error rate across aggregate of processor.event: transaction

As we're also showing the rate of errors, maybe a more useful metric than the total number of transactions would be the rate of transactions per minute. This metric is less dependant on a secondary context which is the selected time frame and thus easier to understand as it stands on its own vs having to check what the time range is. This gets especially complicated if a time range in the past is used where you'd have do some arithmetic to know many hours are in that time range.

The same considerations apply to the log rate widget.

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute

I agree, this would be easier to understand. This is also aligned with what we already show in APM.

@felixbarny @sqren I think both suggestions are very reasonable - let's make sure to change the data contracts with the Logs team to be able to provide log rate per second/minute instead of the aggregate count. @cauemarcondes Will you open a new issue for this with the Logs UI team?

maybe a more useful metric than the total number of transactions would be the rate of transactions per minute [...] The same considerations apply to the log rate widget.

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

By using the log rate per minute instead of the count we will show two very similar metrics in two places. If we use the count, we show both total volume and rate (which, more is better, right? right?).

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

As I understand it, the visualizations we've been referencing in the design is the Log entries visualization.

Screenshot 2020-07-01 at 15 40 46

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision? And if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to. Maybe because I'm not all that familiar with the topic re: logs and rate.

it's fixed to 15 minute buckets. Not sure about the reasoning behind that decision?

I think it's related to how the ML job process the log entries, _but don't quote me on that_. @weltenwort can probably give you the right answer.

if we add the Transaction rate for APM, which will be dynamic down to per minute, it'll be hard for the user to correlate the two charts if they want to.

I'm querying the data for the dashboard will use whatever startTime, endTime and bucketSize are passed as a parameter. I assume other plugins will use the same parameters, so the graphs should all be equivalent for the provided time range.

Edit: Ongoing work for the query https://github.com/elastic/kibana/pull/70413

Yeah what @afgomez said -- you can't really use the existing chart as a reference because it's tied completely to ML, and we are building something that for the overview page that doesn't use ML at all for this rate.

I think it's related to how the ML job process the log entries, but don't quote me on that. @weltenwort can probably give you the right answer.

Off-topic: We also ran into this for APM. We went a little overboard and interpolate the ML values when the buckets are smaller than 15minutes so it fits with our APM data - I don't think this is necessary but it's nice now we have it.

it's fixed to 15 minute bucket

I also thought that was the case but turns out the bucket size is dynamic (in this case the bucket size is 5265 minutes):

86279206-f28f5b00-bbd9-11ea-9270-3482d9c199f0

So perhaps the text that says "Bucket span: 15 minutes" should be updated to avoid confusion?

The date histogram shows already a "Log rate per bucket size". From a user perspective, isn't that enough to get an idea of the average rate?

Especially if there's a lot of variability in the chart, it's not always that easy to know what the average is. If, for example, you'd want to compare the average log rate before vs after a release it will be really helpful to have that on the chart vs the user having to calculate that based on all data points in the chart.

Is there a use case that I'm missing? Is the log rate per minute (vs per bucket size) such an interesting metric that deserves to exist on its own?

I think it's even a benefit in terms of consistency if both the single metric and the date histogram chart refer to the exact same metric. I've seen this as a common practice in other dashboarding tools where you'd have certain metrics, like avg, min, max, in the legend for a graph next to the color and the label for the line. That's basically condensing all the values in the chart to a single value.

The challenge I see is that the bucket size is not dynamic in the current logs visualization, it's fixed to 15 minute buckets.

I think that ideally, the metric should be the same for the overall metric count and the metric shown in the date histogram chart. Maybe it's just me but I prefer to have normalized values that don't change as you change the date range. For example, instead of showing the number of total logs per bucket, we may normalize it to log rate per minute, no matter if the bucket size is 1m, 15m, or 5265m.

We had a Zoom call to discuss the above feedback and next steps. We decided to continue with showing the log rate at a fixed rate (per minute). @afgomez will handle the changes in https://github.com/elastic/kibana/pull/70413

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bradvido picture bradvido  路  3Comments

timroes picture timroes  路  3Comments

celesteking picture celesteking  路  3Comments

cafuego picture cafuego  路  3Comments

LukeMathWalker picture LukeMathWalker  路  3Comments