Kibana: Monitoring should integrate with telemetry like any other plugin

Created on 5 Mar 2019 · 28Comments · Source: elastic/kibana

When monitoring is enabled, it modifies the way telemetry collectors fire in a problematic way. Specifically, it results in all of the various collectors firing frequently (every 10 seconds or so) rather than only when needed to send the telemetry payload. This is excessive, and the performance impact on the Elasticsearch cluster is non-trivial.

This is roughly how telemetry works when monitoring is disabled, which is the desirable behavior all of the time:

User loads Kibana in their browser
Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
Server executes all of the registered telemetry collectors which fetch data and sends the combined results back to browser
Browser sends payload to Elastic telemetry service

This is roughly how telemetry works when monitoring is enabled, which is bad:

Perpetually in the background of the server, the monitoring agent collects monitoring information _and_ fires all registered telemetry collectors every 10 seconds, and it sends all of this data to the monitoring cluster.
User loads Kibana in their browser
Kibana determines if telemetry has been sent in the last 24 hours, does nothing if it has
If the last telemetry payload happened more than 24 hours ago, the browser requests telemetry from the server
Server retrieves pre-computed telemetry data from monitoring cluster
Browser sends payload to Elastic telemetry service

Telemetry should be decoupled entirely from monitoring. Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload. The same process as any other plugins.

Note that the monitoring agent would still send its monitoring data to the monitoring cluster on the background interval as always. The only impact to monitoring here is that it is less tightly coupled to telemetry and as a result the telemetry collectors aren't fired alongside the monitoring agent collection.

I consider the current behavior a bug given the detrimental affect it can have on the Elasticsearch cluster.

Telemetry KibanaTelemetry Monitoring bug

Source

epixa

Most helpful comment

Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.

Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)
Modify the kibana monitoring collection code to _ONLY_ collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.
Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

I have this code working locally and will put up the PR if folks are happy with this approach.

chrisronline on 21 Mar 2019

👍4

All 28 comments

Pinging @elastic/kibana-platform

elasticmachine on 5 Mar 2019

Pinging @elastic/stack-monitoring

elasticmachine on 5 Mar 2019

@pickypg @tsullivan Please take a look at this and let me know if I'm missing important considerations

epixa on 5 Mar 2019

Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload.

This means that any _monitoring_ cluster won't have that data and requires that _every_ cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

Perhaps a simpler solution is to just dramatically slow down the rate that monitoring collects telemetry (maybe once daily or hourly?) _and_ have the monitoring-telemetry reader allow for a wider historical time range for its latest report.

Even better, the stack monitoring team could incorporate some of that data into the UI.

pickypg on 6 Mar 2019

@Bamieh

jinmu03 on 6 Mar 2019

This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's _not_ how it works today?

epixa on 6 Mar 2019

Even better, the stack monitoring team could incorporate some of that data into the UI.

We've talked about this for a long time, but I'm starting to feel like a) the ship is really sailing away on this, and b) sticking a LOT of data in a user's monitoring cluster is not a good forcing function for us to plan how to make the data useful for them.

If you haven't opted in to telemetry for a Kibana install [of a remote cluster], that Kibana install shouldn't be collecting telemetry

dramatically slow down the rate that monitoring collects telemetry

Seriously, I don't think we should have any Kibana telemetry data in Monitoring. The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

There is still to address the Kibana metricbeat collector which reads from Kibana stats API every 10 seconds, which triggers usage collector fetches. Perhaps it's time to make that a separate API. However, in the scenario where we take telemetry data out of Monitoring, I'm not sure what the consumer of that API would be.

tsullivan on 6 Mar 2019

I'll offer some more thoughts later, but one thing to consider with this ask is that we're migrating from internal collection to external collection of monitoring data and have this available for users today. As a result, we plan to deprecate and remove internal shippers in each stack product, including Kibana. We will be removing all code around kibana sending data directly to a monitoring cluster

chrisronline on 6 Mar 2019

This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?

Correct. Kibana Monitoring has an all-or-nothing approach to what it collects, so all of this data would be included. That said, if telemetry is not enabled (still opt-in) then it won't send it (and it's wholly a waste).

The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.

Perhaps, but this means that we'd have less telemetry details about every Kibana instance excluding the one that happens to be sending the telemetry, for deployments that have more than one instance -- particularly for anyone that deployed multiple instances before we introduced Kibana Spaces for isolation, since we would not see their data when we live pulled.

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

pickypg on 6 Mar 2019

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

jinmu03 on 7 Mar 2019

when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?

Correct. The exporters are only relevant about where that .monitoring-kibana* index gets created / written to though. What matters to Kibana is: is collection enabled? Yes? Then it's going to fetch the data every interval (technically configurable, but defaults to 10s).

pickypg on 7 Mar 2019

To @chrisronline 's point - I think this issue title should be changed. Monitoring in Kibana doesn't really "interact" with telemetry - it just so happens there is a usage service in Kibana that allows clients to register with it, and the service will call fetch on all the collectors every 10 seconds. Whether that is happening because Monitoring is enabled and internal collection is kicking off, or not (meaning an external Metricbeat is calling an API every 10 seconds and triggers data collection each time), it ends up being a lot of queries and that is the issue.

tsullivan on 8 Mar 2019

We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

peterschretlen on 8 Mar 2019

Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.

Currently, we only know if the ES cluster has multiple Kibana indices in it (and therefore multiple instances running) when we pull Kibana stats from the Monitoring data. So we can get that number as it is now, but the solution I'm proposing would effectively do away with our ability to do that going forward.

tsullivan on 8 Mar 2019

How does everyone feel about removing this collection behavior now and reintroducing it when:
1) we have a ui
1) fetching usage doesn't require a downstream request for each metric

I think it's especially important to remove this now before anything is built on it. The usage architecture needs to be reorganized to support more metrics at the monitoring interval. A longer interval may be worth exploring, but removal sounds more straight forward short term and will give us time to take a step back.

jbudz on 11 Mar 2019

👍3

@chrisronline @tsullivan @pickypg We have decided to add another interval(prolly once per day) other than removing the collection behavior. Do you have any thoughts on what it would look like to add a separate interval? Would it be a significant restructure?

jbudz on 19 Mar 2019

I'm happy to take on the effort, but I'm not sure what @tsullivan had in mind exactly. Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

chrisronline on 19 Mar 2019

Did we ever have issue with tens of thousands of Kibana browsers consuming telemetry service concurrently?

jinmu03 on 20 Mar 2019

Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?

It would be a once-per-day collection that fetches Stats collectors as well as Usage collectors.

Usage collectors can be filtered out here: https://github.com/elastic/kibana/blob/0835cd30cab40f6f73a32a95834e871213ac9b0c/x-pack/plugins/monitoring/server/kibana_monitoring/bulk_uploader.js#L68 with c => c instanceof UsageCollector

Here's where other code does that kind of filter for getting a set of only the Usage collectors: https://github.com/elastic/kibana/blob/4aa2a4c05f128f09e113d29f8a31557c1875717f/src/legacy/server/usage/classes/collector_set.js#L104

Maybe it doesn't have to be a separate interval, but maybe the interval could check on a cached flag to determine if it's time to do a full fetch, and it would only do that once per day.

tsullivan on 20 Mar 2019

Trying to do c instanceof UsageCollector from the x-pack might be hard, because you'd have to import the class over from core. Maybe usage collectors just need an automatic flag that can be checked, similar to the ignoreForInternalUploader trick:

c => c.ignoreForInternalUploader !== true && c.isUsageCollector !== true

tsullivan on 20 Mar 2019

Add a flag (isUsageCollector) to each usage collector, as an actual property similar to ignoreForInternalUploader so monitoring can know it's a usage collector (see why we have to do it this way)
Modify the kibana monitoring collection code to _ONLY_ collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-* documents will contain a usage field, and some will not.
Update the .monitoring-kibana index template to include mappings for the usage field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)

I have this code working locally and will put up the PR if folks are happy with this approach.

chrisronline on 21 Mar 2019

👍4

Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data

Is this why mappings need to be included for the usage field?

tsullivan on 28 Mar 2019

@tsullivan Correct. We need someway to create a query that ignores documents that do not contain this field

chrisronline on 29 Mar 2019

Is usage an object field? Could it just be mapped as object ? We don't need the inner data to be mapped, correct?

tsullivan on 30 Mar 2019

@tsullivan Theoretically, yes. I tried it and was unsuccessful in getting the exists query to work unless I mapped a field inside of the object, but it's entirely possible there is some setting that could make this work. Doesn't matter to me either way, but we will need some mapping change

chrisronline on 1 Apr 2019

@chrisronline an exists: usage query should work if usage is just mapped as type: 'object'. I ran a quick test. LMK if you'd like to take it offline.

I'm asking this to avoid having to map any part of the combined data that gets returned from fetching all the collectors. It

tsullivan on 4 Apr 2019

👍1

Pinging @elastic/kibana-stack-services

elasticmachine on 24 Jul 2019

Pinging @elastic/pulse (Team:Pulse)

elasticmachine on 12 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings