Kibana: Telemetry & Monitoring: Kibana Monitoring & BulkUploader

Created on 12 Jun 2020  路  12Comments  路  Source: elastic/kibana

Lately we've noticed that when adding some cluster-level stats (#68603 and #64935), those collectors are not registered as Usage or Stats collector in Kibana, because they are not supposed to be reported as part of the stack_stats.kibana.plugins payload.

This results in missing some information when monitoring is enabled. As far as I could understand from taking a look at the code in x-pack/plugins/monitoring, the information reported to the Monitoring cluster is collected via the code in x-pack/plugins/monitoring/server/kibana_monitoring. More specifically, in the bulk_uploader.js file.

I'm creating this issue to review the logic in BulkUploader to:

Stack Monitoring Telemetry Meta KibanaTelemetry Monitoring

All 12 comments

Pinging @elastic/stack-monitoring (Team:Monitoring)

Pinging @elastic/pulse (Team:KibanaTelemetry)

When monitoring is enabled, cluster_stats usage data is read from the .monitoring-es-* indices. This means that any usage data not collected from Kibana needs to be added to those indices (pushed by elasticsearch and beats) in order to retain parity between local and monitoring collection. We need to decide if this is appropriate and, if not, determine the best path forward for monitoring-shipped usage data.

@Bamieh
I ran a terms agg to see the % of usage data that's reported through local and monitoring collection:

{
  "aggs" : {
    "telemetry_collection" : {
      "terms" : { "field" : "stack_stats.xpack.monitoring.collection_enabled" } 
    }
  }

Roughly 25% of data that's reported is through monitoring.

And it could happen that those monitored clusters are also reporting the local telemetry themselves? 馃槄

@afharo maybe, maybe not. We can't tell, since we don't combine local collection and collection through monitoring ATM.

I mean: the monitored cluster reporting local telemetry + the monitoring cluster reporting on its behalf.
It would be nice to know that ratio because if, for instance, 90% of the clusters that are reported via monitoring also report local-collected telemetry themselves, then, disabling telemetry from monitoring would affect to even fewer clusters: we would only lose 2.5% of the clusters

Are there any differences when Metricbeat is used vs. the legacy collector mechanism? Has this API

There should not be a difference here. The bulk uploader (which is used by monitoring plugin when collecting monitoring data for legacy collection) and the stats api (which is used by Metricbeat when collecting monitoring data for Metricbeat collection) should use the same exact code, or at the very least, return the same output. There is a ticket to better consolidate this but hasn't been worked on yet. It's worth noting that we have a collection of parity tests that ensure Metricbeat collected monitoring documents are identical to documents collected through legacy collection.

For future proofing, the bulk uploader is going away in 8.0. We are currently deprecating that behavior for 7.x and will completely remove it in 8.0 so it might not be worth it to invest much in that area of the code.

We still want to be sure we understand the telemetry story here, but I'm not sure I'm entirely up to date on it. Happy to help anyway I can though

I see this is targeted for 7.10. Are we still on track for that release? Many production clusters have monitoring enabled and we'll want to start receiving additional telemetry for them as soon as possible. Let me know if there is anything I can do to help expedite.

AFAIK, we are discussing an RFC to, possibly entirely remove the Kibana-related telemetry from the monitoring collection. If that happens, I think we can close or repurpose this issue to make that happen 馃檪

++ I believe we capture data from multiple Kibana instances today when monitoring is enabled, so we'd have to understand impact of removing complete.

I do think not having data telemetry from monitoring clusters will become more visible soon as we begin to trust and use the data. If 25% of clusters really have monitoring enabled, and most production clusters have monitoring enabled (assumption) then we're really only capturing a small subset of production clusters. Should we have a sync specifically to discuss the RFC?

After the discussions in the RFC and the changes in https://github.com/elastic/kibana/pull/82638, I think we can close this issue.

There will be one outstanding item:

  • [ ] Do not start collecting until Self-Monitoring is enabled and fully started.

But since bulk_uploader is going to be removed in 8.0, maybe we can let it be for now?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

bradvido picture bradvido  路  3Comments

cafuego picture cafuego  路  3Comments

snide picture snide  路  3Comments

bhavyarm picture bhavyarm  路  3Comments

MaartenUreel picture MaartenUreel  路  3Comments