When monitoring is enabled, it modifies the way telemetry collectors fire in a problematic way. Specifically, it results in all of the various collectors firing frequently (every 10 seconds or so) rather than only when needed to send the telemetry payload. This is excessive, and the performance impact on the Elasticsearch cluster is non-trivial.
This is roughly how telemetry works when monitoring is disabled, which is the desirable behavior all of the time:
This is roughly how telemetry works when monitoring is enabled, which is bad:
Telemetry should be decoupled entirely from monitoring. Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload. The same process as any other plugins.
Note that the monitoring agent would still send its monitoring data to the monitoring cluster on the background interval as always. The only impact to monitoring here is that it is less tightly coupled to telemetry and as a result the telemetry collectors aren't fired alongside the monitoring agent collection.
I consider the current behavior a bug given the detrimental affect it can have on the Elasticsearch cluster.
Pinging @elastic/kibana-platform
Pinging @elastic/stack-monitoring
@pickypg @tsullivan Please take a look at this and let me know if I'm missing important considerations
Monitoring should have a telemetry collector just like any other plugin, and when telemetry is requested as part of the daily payload to the Elastic telemetry service, that collector should grab the necessary data from the monitoring cluster and include it in the telemetry payload.
This means that any _monitoring_ cluster won't have that data and requires that _every_ cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.
Perhaps a simpler solution is to just dramatically slow down the rate that monitoring collects telemetry (maybe once daily or hourly?) _and_ have the monitoring-telemetry reader allow for a wider historical time range for its latest report.
Even better, the stack monitoring team could incorporate some of that data into the UI.
@Bamieh
This means that any monitoring cluster won't have that data and requires that every cluster has opted into telemetry in order to collect the data that previously would have been in monitoring.
This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's _not_ how it works today?
Even better, the stack monitoring team could incorporate some of that data into the UI.
We've talked about this for a long time, but I'm starting to feel like a) the ship is really sailing away on this, and b) sticking a LOT of data in a user's monitoring cluster is not a good forcing function for us to plan how to make the data useful for them.
If you haven't opted in to telemetry for a Kibana install [of a remote cluster], that Kibana install shouldn't be collecting telemetry
++
dramatically slow down the rate that monitoring collects telemetry
Seriously, I don't think we should have any Kibana telemetry data in Monitoring. The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.
There is still to address the Kibana metricbeat collector which reads from Kibana stats API every 10 seconds, which triggers usage collector fetches. Perhaps it's time to make that a separate API. However, in the scenario where we take telemetry data out of Monitoring, I'm not sure what the consumer of that API would be.
I'll offer some more thoughts later, but one thing to consider with this ask is that we're migrating from internal collection to external collection of monitoring data and have this available for users today. As a result, we plan to deprecate and remove internal shippers in each stack product, including Kibana. We will be removing all code around kibana sending data directly to a monitoring cluster
This seems like the right behavior to me. If you haven't opted in to telemetry for a Kibana install, that Kibana install shouldn't be collecting telemetry. Are you saying that's not how it works today?
Correct. Kibana Monitoring has an all-or-nothing approach to what it collects, so all of this data would be included. That said, if telemetry is not enabled (still opt-in) then it won't send it (and it's wholly a waste).
The way Kibana's telemetry data gets collected could work just as it does when Monitoring is disabled: live pull.
Perhaps, but this means that we'd have less telemetry details about every Kibana instance excluding the one that happens to be sending the telemetry, for deployments that have more than one instance -- particularly for anyone that deployed multiple instances before we introduced Kibana Spaces for isolation, since we would not see their data when we live pulled.
We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.
when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?
when monitoring is enabled, the data is sent to .monitoring.* every 10 sec no matter whether there is a remote monitoring cluster or not and whether we use metricbeats, http exporter, or local exporter?
Correct. The exporters are only relevant about where that .monitoring-kibana*
index gets created / written to though. What matters to Kibana is: is collection enabled? Yes? Then it's going to fetch the data every interval (technically configurable, but defaults to 10s).
To @chrisronline 's point - I think this issue title should be changed. Monitoring in Kibana doesn't really "interact" with telemetry - it just so happens there is a usage service in Kibana that allows clients to register with it, and the service will call fetch on all the collectors every 10 seconds. Whether that is happening because Monitoring is enabled and internal collection is kicking off, or not (meaning an external Metricbeat is calling an API every 10 seconds and triggers data collection each time), it ends up being a lot of queries and that is the issue.
We still need to pull content from Monitoring, whenever possible, because otherwise Logstash / Beats / APM Server would be invisible to us. I don't see a big downside to only live pulling from Kibana in either case, but we need to be aware that it comes with a loss of detail about other Kibana instances' unique usage stats.
Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.
Could the telemetry data tell us how common that case is (multi-kibana instances in a deployment)? If the number is small then it sounds like a good tradeoff.
Currently, we only know if the ES cluster has multiple Kibana indices in it (and therefore multiple instances running) when we pull Kibana stats from the Monitoring data. So we can get that number as it is now, but the solution I'm proposing would effectively do away with our ability to do that going forward.
How does everyone feel about removing this collection behavior now and reintroducing it when:
1) we have a ui
1) fetching usage doesn't require a downstream request for each metric
I think it's especially important to remove this now before anything is built on it. The usage architecture needs to be reorganized to support more metrics at the monitoring interval. A longer interval may be worth exploring, but removal sounds more straight forward short term and will give us time to take a step back.
@chrisronline @tsullivan @pickypg We have decided to add another interval(prolly once per day) other than removing the collection behavior. Do you have any thoughts on what it would look like to add a separate interval? Would it be a significant restructure?
I'm happy to take on the effort, but I'm not sure what @tsullivan had in mind exactly. Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?
Did we ever have issue with tens of thousands of Kibana browsers consuming telemetry service concurrently?
Do we just setup a separate interval on the server to collect and ship monitoring data, assuming monitoring is enabled?
It would be a once-per-day collection that fetches Stats collectors as well as Usage collectors.
Usage collectors can be filtered out here: https://github.com/elastic/kibana/blob/0835cd30cab40f6f73a32a95834e871213ac9b0c/x-pack/plugins/monitoring/server/kibana_monitoring/bulk_uploader.js#L68 with c => c instanceof UsageCollector
Here's where other code does that kind of filter for getting a set of only the Usage collectors: https://github.com/elastic/kibana/blob/4aa2a4c05f128f09e113d29f8a31557c1875717f/src/legacy/server/usage/classes/collector_set.js#L104
Maybe it doesn't have to be a separate interval, but maybe the interval could check on a cached flag to determine if it's time to do a full fetch, and it would only do that once per day.
Trying to do c instanceof UsageCollector
from the x-pack might be hard, because you'd have to import the class over from core. Maybe usage collectors just need an automatic flag that can be checked, similar to the ignoreForInternalUploader
trick:
c => c.ignoreForInternalUploader !== true && c.isUsageCollector !== true
Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.
Add a flag (isUsageCollector
) to each usage collector, as an actual property similar to ignoreForInternalUploader
so monitoring can know it's a usage collector (see why we have to do it this way)
Modify the kibana monitoring collection code to _ONLY_ collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the xpack_main
plugin). However, the monitoring code will pull from stats collectors at the current interval (default to 10s
) to ensure the monitoring service and UIs continue to function properly. This will mean that some .monitoring-kibana-*
documents will contain a usage
field, and some will not.
Update the .monitoring-kibana
index template to include mappings for the usage
field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clusters
Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage
field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)
I have this code working locally and will put up the PR if folks are happy with this approach.
Update the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the usage field to ensure it only fetches documents that actually contain usage data
Is this why mappings need to be included for the usage
field?
@tsullivan Correct. We need someway to create a query that ignores documents that do not contain this field
Is usage
an object field? Could it just be mapped as object
? We don't need the inner data to be mapped, correct?
@tsullivan Theoretically, yes. I tried it and was unsuccessful in getting the exists
query to work unless I mapped a field inside of the object, but it's entirely possible there is some setting that could make this work. Doesn't matter to me either way, but we will need some mapping change
@chrisronline an exists: usage
query should work if usage
is just mapped as type: 'object'
. I ran a quick test. LMK if you'd like to take it offline.
I'm asking this to avoid having to map any part of the combined data that gets returned from fetching all the collectors. It
Pinging @elastic/kibana-stack-services
Pinging @elastic/pulse (Team:Pulse)
Most helpful comment
Okay I think I have a plan for this and want to run it by everyone. This solution will ensure that kibana monitoring collection will only poll for usage collector data at the same interval as the normal telemetry service.
Add a flag (
isUsageCollector
) to each usage collector, as an actual property similar toignoreForInternalUploader
so monitoring can know it's a usage collector (see why we have to do it this way)Modify the kibana monitoring collection code to _ONLY_ collect from usage collectors at the same rate as default telemetry (This will be kept in sync by exposing the interval through the
xpack_main
plugin). However, the monitoring code will pull from stats collectors at the current interval (default to10s
) to ensure the monitoring service and UIs continue to function properly. This will mean that some.monitoring-kibana-*
documents will contain ausage
field, and some will not.Update the
.monitoring-kibana
index template to include mappings for theusage
field (it currently does not) - Doing this will require a version dance as we want to ensure that folks don't run into a situation where Kibana is assuming this mapping exists but it doesn't yet in ES. This shouldn't happen as we explicitly recommend the monitoring cluster to be at least the same version as all monitored clustersUpdate the query for collecting usage data from monitoring indices for telemetry to filter out documents that do not contain the
usage
field to ensure it only fetches documents that actually contain usage data (which will now occur at the same interval as default telemetry)I have this code working locally and will put up the PR if folks are happy with this approach.