Kibana: Collect non-sensitive platform-specific config stats

Created on 4 Feb 2020  路  30Comments  路  Source: elastic/kibana

We would like to understand what platform features are used and how they configured by the users. That would help us make decisions about API completeness.
The main problem with this that the platform doesn't have access to the Telemetry plugin. Nor telemetry plugin has access to the platform config.

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

Questions we'd like to answer for config things:

  • How many SavedObjects do we have?
  • Which plugins were disabled by the user?
  • Are any 3rd party plugins installed? Which ones?

Data to collect, all keys should be opt-in:

  • Disabled plugins + installed 3rd party plugins
  • HTTP config (?)
  • Elasticsearch config (?)
  • Logging config (?)
New Platform Core KibanaTelemetry

Most helpful comment

I feel that we can move the usageCollection to core. We do not allow disabling this plugin since we always want it to be running. Usage collection is used by monitoring, all plugins sending telemetry, and potentially pulse collectors (still under discussion).

That was my feeling too. A plugin that can't/shouldn't be disabled and is used by every other plugin should probably be part of core.

All 30 comments

Pinging @elastic/kibana-platform (Team:Platform)

Pinging @elastic/pulse (Team:Pulse)

The main problem with this that the platform doesn't have access to the Telemetry plugin.

Only option here would be to open hook-like API as we do for the http authent, I think. We really need a good reason to do so however as this is relatively heavy changes.

Nor telemetry plugin has access to the platform config.

Which part of the config would the plugin actually need access to? Everything?

Which part of the config would the plugin actually need access to? Everything?

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.
maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc. OTOH our system should allow us easily to adjust the list of metrics.

@restrry Telemetry plugin is only responsible for sending telemetry to our cluster. The usageCollection plugin will be responsible for fetching data from the platform in this case (via the usageCollection.registerCollector method).

I feel that we can move the usageCollection to core. We do not allow disabling this plugin since we always want it to be running. Usage collection is used by monitoring, all plugins sending telemetry, and potentially pulse collectors (still under discussion).

I feel that we can move the usageCollection to core. We do not allow disabling this plugin since we always want it to be running. Usage collection is used by monitoring, all plugins sending telemetry, and potentially pulse collectors (still under discussion).

That was my feeling too. A plugin that can't/shouldn't be disabled and is used by every other plugin should probably be part of core.

Reopening the discussion as this is currently due for 7.9.

  • Regarding the usageCollection plugin

So, as it seems we did get a consensus on that, should we move the usageCollection feature from the current plugin to a core service?

usageCollection has no declared dependencies, so it's a least technically doable, even if this is doing to be quite a lot of changes as the plugin is currently used a lot.

BTW The plugin is currently owned by @elastic/pulse. Moving it to core would change the owner to the platform team (I guess?). Is everybody alright with that?

  • Regarding the data core wants to collect

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.
maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc.

Any further insight on what we should collect, as least for the initial version?

BTW The plugin is currently owned by @elastic/pulse. Moving it to core would change the owner to the platform team (I guess?). Is everybody alright with that?

I have no objections to that.

One item we'd like to include this as well is which 3rd party plugins are installed + enabled.

I think we can implement this without moving it to core. We can create a very simple plugin where we add in the kibana.json all the known third-party plugins to the optionalDependencies. The collector will check if they are available in the dependencies parameter and return true or false based on that. Would that work?

Also, if those plugins are registered apps in Kibana, application_usage will automatically report them whenever they are used.

In theory yes, we might want to track any part of the config: from Elasticsearch SSL setup to logging flags.

"Any" seems a bit risky. I agree with only exposing a well-defined set of properties we may want to report.

Q: Can we retrieve any of those configurations via an Elasticsearch API request?

NB: Maybe core can actually define a CoreUsageCollector service that will call usageCollector.makeUsageCollector and usageCollector.registerUsageCollector. Technically, the usageCollector can receive new _registrations_ at any given point. It's not limited to the setup lifecycle step.

maybe we should start by defining the information we're interested in: config values, plugin usage stats, runtime metrics, etc.

Please, bear in mind the current payload (it can be seen in Management > Stack Management > Kibana > Advanced Settings > Usage Data > See an example of what we collect) is already huge and we might be duplicating an effort about something that is already reported:

i.e.: We already collect the overridden configurations in the UI Management section: https://github.com/elastic/kibana/blob/master/src/plugins/kibana_usage_collection/server/collectors/management/telemetry_management_collector.ts#L26-L43

Re the runtime metrics, maybe APM is the way to go? Personally, I think anything we may implement to measure them would end up looking pretty similar to APM's existing agents. There is a blocker at the moment, but I'd say we push the effort to get it fixed over re-implementing our own solution.

I think we can implement this without moving it to core. We can create a very simple plugin where we add in the kibana.json all the known third-party plugins to the optionalDependencies

That seems like a very fragile mechanism compared to using the informations from the plugin service imho. Also we won't dissociate non-installed plugins from disabled ones.

NB: Maybe core can actually define a CoreUsageCollector service that will call usageCollector.makeUsageCollector and usageCollector.registerUsageCollector

core shouldn't (I won't say can't because it's not technically totally true, but it should be considered as is) consume any plugin API.

Please, bear in mind the current payload (it can be seen in Management > Stack Management > Kibana > Advanced Settings > Usage Data > See an example of what we collect) is already huge

Hum, I really don't know well how we are sending data, but the whole payload containing all the metrics is always sent? We don't have an event-based API to send custom events with custom payloads?

That seems like a very fragile mechanism compared to using the informations from the plugin service imho. Also we won't dissociate non-installed plugins from disabled ones.

Yeah, core has more visibility and can provide better insights :)

core shouldn't (I won't say can't because it's not technically totally true, but it should be considered as is) consume any plugin API.

Fair enough. I'm really OK with moving the usageCollection plugin to core. It was a suggestion as an optional approach we might want to explore :)

Hum, I really don't know well how we are sending data, but the whole payload containing all the metrics is always sent? We don't have an event-based API to send custom events with custom payloads?

We don't have any event-based API. We only send the usage payload once every 24h (with a * here because it's not that strict). At that point of sending the telemetry payload, we call the fetch methods registered to the usageCollection plugin.
At the moment, we don't send telemetry on an event-triggered way. Some pieces in the telemetry may be event related (like ui_metric), but they are stored locally and send to the remote telemetry cluster only when the fetch method is called (around once every 24h).

The primary argument I can think of against moving usage collectors to Core is more of an organization / team design one then a technical one. In essence, I think it's simpler if all the code in given subdirectory is fully owned by a single team. It eliminates communication overhead and allows a single team to make sweeping changes to that directory without coordination.

That said, usage collectors is such a small pattern / mechanism that it may be simple enough to just move to Core and make the Platform team the owners. However if there are future enhancements planned for this pattern or we expect to use something else entirely in the near- to mid-term, then I think we should explore other options.

One way we could achieve what we need here without moving usage collectors, is to expose an API from Core that _essentially_ is a usage collector, but isn't tied into the registry of usage collectors directly. Instead, the telemetry plugin could setup a usage collector to call this Core API in order to grab data from Core.

Example

Extension of the Core API

interface CoreSetup {
  // Core could expose some API that is _essentially_ a usage collector
  telemetry: {
    type: 'kibana_core',
    schema: CoreTelemetrySchema;
    fetch(): Promise<CoreTelemetry>;
  }
}

Telemetry plugin uses Core API to create a core usage collector

class TelemetryPlugin {
  setup(core, plugins) {
    const coreCollector = usageCollection.makeUsageCollector(core.telemetry);
    plugins.usageCollection.registerCollector(coreCollector);
  }
}

The primary argument I can think of against moving usage collectors to Core is more of an organization / team design one then a technical one

We shouldn鈥檛 restrict ourselves to the current team/organization design when it comes to making the right decisions. Teams can be restructured and need to change with Elastic鈥檚 needs 馃槃

I think it's simpler if all the code in given subdirectory is fully owned by a single team. It eliminates communication overhead and allows a single team to make sweeping changes to that directory without coordination.

A lot of different teams send usage data to our telemetry cluster and the demand for data is increasing. The data we collect from Kibana allows business to monitor solutions adoptions and, IMHO, any sweeping changes made to the collection mechanism and the data we do collect needs to be communicated and controlled. ATM, nothing鈥檚 really stopping the small kibana-telemetry team from making such changes in isolation and, with the importance of usage data increasing, it鈥檚 maybe time to reconsider that approach.

However if there are future enhancements planned for this pattern or we expect to use something else entirely in the near- to mid-term, then I think we should explore other options.

Code wise, we can certainly take the approach you suggest to get the job done. We do have a few improvement initiatives we're working on while helping Infra take on the task of the remote Elastic Telemetry service.

A lot of different teams send usage data to our telemetry cluster and the demand for data is increasing. The data we collect from Kibana allows business to monitor solutions adoptions and, IMHO, any sweeping changes made to the collection mechanism and the data we do collect needs to be communicated and controlled. ATM, nothing鈥檚 really stopping the small kibana-telemetry team from making such changes in isolation and, with the importance of usage data increasing, it鈥檚 maybe time to reconsider that approach.

Sounds like there is potential for a decent amount of change in this coming. We aim for Core APIs to be stable-ish, so I lean towards the approach I outlined above. If we go with this option, I think the Platform team could handle all of the changes. We can have the Pulse team review the change to the usage_collection plugin to integrate with the new Core API.

The main downside I see of this approach is that we'd be exposing internal data from core that is not normally available to plugins. However, I think we can document that this API is only valid for telemetry and should not be used for any other reason. @elastic/kibana-platform any objections?

@joshdover What I was trying to say is that we need got to a point where the usage collection and telemetry API's _are_ stable. The improvements we're investigating are under the condition that they won't change the usage collection API's. We're striving to get to a point where collecting data and shipping it to the remote service is as stable and reliable as using Saved Objects or any of the other Core capabilities.

I'm by no means trying to shift the work over to another team, we're more than willing to take the task on and @afharo 's already working on a draft. What I was trying to say is that telemetry has become so important, that we need to be careful about the approach from here on.

That being said, we can always reconsider the approach later if needs be.
For now, my vote goes to the Extension of the Core API
I'll wait to see what others think.

Thank you @TinaHeiligers and @joshdover!

I think I agree with you both: the Usage Collection APIs (not to be confused with the telemetry plugin) are key APIs and I think they should belong to core eventually.
At the same time, I think that we, at the Kibana Telemetry team, are currently planning some improvements on the existing APIs (either extensions or internal improvements), so maybe they are not mature and stable enough to do the move for now.

+1 on the core.telemetry extension, but with 2 heads up:

  1. Beware of the publicly exposed data to other plugins (maybe, for security purposes, we can limit that core.telemetry.getUsageCollector() can only be called only once or Kibana will crash?)
  2. The Kibana Telemetry team will aim to design the UsageCollector plugin in a way that will be easier to migrate in the future (as long as these thoughts don't affect too much on the velocity in delivering the features and improvements) :)

Looking forward to knowing all your thoughts :)

interface CoreSetup {
  // Core could expose some API that is _essentially_ a usage collector
  telemetry: {
    type: 'kibana_core',
    schema: CoreTelemetrySchema;
    fetch(): Promise<CoreTelemetry>;
  }
}

Let's be honest, it's definitely not a correct design, we are just cheating with pseudo inversion of control here. But it's the most pragmatic solution in term of impact to the codebase and to the code owners, so I'm not opposed to it.

However, I'm wondering if we will be able to achieve this option without importing (or duplicating) types from the usageCollector plugin to core though? What is going to be the type of CoreSetup.telemetry?

I totally understand your point, @pgayvallet!

How about meeting somewhere in the middle? What about core providing a 'stats' API (as part of the CoreStart contract instead) that works in some form of fetch method and we'll register the collector in the kibana_usage_collection plugin (that's where we've defined most of the "general-purpose" usage collectors)?

@joshdover the 7.9 deadline looks a little tight here. Do you still think that we can make it before FF?

@TinaHeiligers I think it's likely this will slip to 7.10 at this point :-/ And that's not on y'all, we've been working on other tasks.

I've ran through all of Core's configuration and come up with a potential list of non-sensitive metrics that we could collect. For sensitive configuration keys (eg. hostname, ssl), I include fields that indicate a feature is being used rather than including the actual configuration value.

There is some overlap with some of the metrics collected by src/plugins/kibana_usage_collection/server/collectors/ops_stats/ops_stats_collector.ts which is supposed to include many of the stats exposed by the MetricsService. However, when taking a look at our actual telemetry, I was not able to find many of these values in our dataset. Some clarification from @afharo would be helpful here on what is actually included and reported.

export interface CoreTelemetry {
  config: CoreConfigTelemetry;
  usage: CoreUsageTelemetry;
  environment: CoreEnvironmentTelemetry;
}
/**
 * Telemetry data on this cluster's usage of Core features
 */
export interface CoreUsageTelemetry {
  savedObjects: {
    totalCount: number;
    typesCount: number;
    /** Record of Kibana version -> number of failures that occured on that version's upgrade attempt */
    migrationFailures: Record<string, number>;
  };
}
/**
 * Telemetry data on this Kibana node's runtime environment.
 */
export interface CoreEnvironmentTelemetry {
  memory: {
    heapTotalBytes: number;
    heapUsedBytes: number;
    /** V8 heap size limit */
    heapSizeLimit: number;
  };
  os: {
    platform: string;
    platformRelease: string;
    distro?: string;
    distroRelease?: string;
  };
}
/**
 * Telemetry data on this cluster's configuration of Core features
 */
export interface CoreConfigTelemetry {
  elasticsearch: {
    sniffOnStart: boolean;
    sniffIntervalMs?: number;
    sniffOnConnectionFault: boolean;
    numberOfHostsConfigured: boolean;
    preserveHost: boolean;
    requestHeadersWhitelistConfigured: boolean;
    customHeadersConfigured: boolean;
    shardTimeoutMs: number;
    requestTimeoutMs: number;
    pingTimeoutMs: number;
    startupTimeoutMs: number;
    logQueries: boolean;
    ssl: {
      verificationMode: 'none' | 'certificate' | 'full';
      certificateAuthoritiesConfigured: boolean;
      certificateConfigured: boolean;
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
      alwaysPresentCertificate: boolean;
    };
    apiVersion: string;
    healthCheck: {
      delayMs: number;
    };
  };

  http: {
    basePath: string;
    maxPayloadInBytes: number;
    rewriteBasePath: boolean;
    keepaliveTimeout: number;
    socketTimeout: number;
    compression: {
      enabled: boolean;
      referrerWhitelistConfigured: boolean;
    };
    cors: {
      enabled: boolean;
    }
    xsrf: {
      disableProtection: boolean;
      whitelistConfigured: boolean;
    };
    requestId: {
      allowFromAnyIp: boolean;
      ipAllowlistConfigured: boolean;
    };
    ssl: {
      certificateAuthoritiesConfigured: boolean;
      certificateConfigured: boolean;
      cipherSuites: Array<'TLSv1' | 'TLSv1.1' | 'TLSv1.2'>;
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
      redirectHttpFromPort?: number;
      supportedProtocols: Array<'TLSv1' | 'TLSv1.1' | 'TLSv1.2'>;
      clientAuthentication: 'none' | 'optional' | 'required';
    };
  };

  logging: {
    appendersTypesUsed: string[];
    loggersConfiguredCount: number;
  };

  plugins: {
    /** list of built-in plugins that are disabled */
    firstPartyDisabled: string[];
    /** list of third-party plugins that are installed and enabled */
    thirdParty: string[];
  };

  savedObjects: {
    maxImportPayloadBytes: number;
    maxImportExportSizeBytes: number;
    migrations: {
      batchSize: number;
      scrollDurationMs: number;
      pollIntervalMs: number;
      skip: boolean;
    };
  };

  uiSettings: {
    overridesCount: number;
  };
}

@joshdover what are your thoughts on collecting ssl config for CoreConfigTelemetry['http'], similar to what we've done for CoreConfigTelemetry['elasticsearch']?

Perhaps something like (added a bonus cors setting too):

/**
 * Telemetry data on this cluster's configuration of Core features
 */
export interface CoreConfigTelemetry {
  http: {
    cors: {
      enabled: boolean;
    };
    ssl: {
      enabled: boolean;
      supportedProtocols: string[];
      keyConfigured: boolean;
      keystoreConfigured: boolean;
      truststoreConfigured: boolean;
    };
  };
} 

Makes sense, not sure how I skipped http.ssl

EDIT: updated my comment with the http.ssl and http.cors fields.

What about CSP? I could see it being helpful to know how many users have a custom CSP policy in place.

For SSL, when the certificate is specified the key must also be specified. I can't think of a good name to denote both of these being configured, so perhaps that's justification for leaving them separate?

What about CSP? I could see it being helpful to know how many users have a custom CSP policy in place.

CSP is already being reported here, although it'd be nice to consolidate the fields into CoreConfigTelemetry instead of having them spread out:

https://github.com/elastic/kibana/blob/fd459dea5d884e883be6fe26747121e768a7844b/src/plugins/kibana_usage_collection/server/collectors/csp/csp_collector.ts#L45-L55

This is fantastic. We have so many different types of saved objects now, is there any value in getting count by type?

There is some overlap with some of the metrics collected by src/plugins/kibana_usage_collection/server/collectors/ops_stats/ops_stats_collector.ts which is supposed to include many of the stats exposed by the MetricsService. However, when taking a look at our actual telemetry, I was not able to find many of these values in our dataset. Some clarification from @afharo would be helpful here on what is actually included and reported.

Hey @joshdover! Sorry for my late response. This collector is treated in a special way at the moment:

  1. For telemetry purposes, it only reports the OS information (code stripping kibana_stats out of the payload).
  2. In Monitoring, it's used by the legacy collector (AFAIK, to be deprecated by 8.0) to report the ops stats to the monitoring cluster.

We have so many different types of saved objects now, is there any value in getting count by type?

@alexfrancoeur we already report the saved objects count per type in here. Although it's limited to these 6 types only. We might want to revisit that list?

@alexfrancoeur we already report the saved objects count per type in here. Although it's limited to these 6 types only. We might want to revisit that list?

Ideally we could generate metrics that are more dynamic so that all types are reported on, that would probably be easier to do from within Core. For saved object migrations it would also be useful to understand the size of the .kibana/.kibana_task_manager indices.

http: {
basePath: string;

Do we care about the actual basepath or just whether or not it was set? Should we rather use the following?

basePathConfigured: boolean

I'm leaving out http.cors until we implement it https://github.com/elastic/kibana/issues/16714

Do we care about the actual basepath or just whether or not it was set? Should we rather use the following?

basePathConfigured: boolean

Good catch, I think basePathConfigured is what we want.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

celesteking picture celesteking  路  3Comments

Ginja picture Ginja  路  3Comments

spalger picture spalger  路  3Comments

snide picture snide  路  3Comments

cafuego picture cafuego  路  3Comments