map() query type inside of source() (PR #21016)The lack of any configuration options specific to the Infra UI should be treated as if the following configuration was present:
xpack.infra:
sources:
default:
metricAlias: 'metricbeat-*'
logAlias: 'filebeat-*'
fields:
message: 'message'
host: 'beat.hostname'
pod: 'kubernetes.pod.name'
container: 'docker.container.name'
timestamp: '@timestamp'
tiebreaker: '_doc'
query:
partitionSize: 75
partitionFactor: 1.2
xpack.infra.sources.defaultThis is the default source. Any additional sources defined here will be ignored for now until the UI offers facilities to switch between sources (#20662).
xpack.infra.sources.default.fields.tiebreakerIf we assume the data to be only filebeat data, the offset field might also be appropriate. Otherwise _doc would be relatively reliable default (even more so once elastic/elasticsearch#25674 is fixed).
xpack.infra.query.partitionSizeThe size of the partitions for the nodes aggregation.
xpack.infra.query.partitionSizeTo get 75 requests per partition it's necessary to request 20% more to get a complete set due to how nodes are distributed across shards.
With the desire in mind to allow for multiple sets of source configurations, how about something like
xpack.infra:
sources:
default: # this would be the name of the group
metricAlias: 'xpack-infra-default-metrics'
logAlias: 'xpack-infra-default-logs'
fields:
message: 'message'
hostname: 'beat.hostname'
pod: 'kubernetes.pod.name'
container: 'docker.container.name'
timestamp: '@timestamp'
tiebreaker: '_doc' # or 'offset'?
query:
partitionSize: 75
partitionFactor: 1.2
Nit: Could we call host: hostname: in the above? I think these are 2 different things.
@weltenwort I think we should go with your proposal
ok, I'll edit the issue description to represent the current state
Another aspect I'm deliberating right now is how the configuration is communicated between client and server.
Most queries require knowledge about a specific data source configuration by the server. There are (at least) two possible ways in which they can be made available to the server:
All required configuration is submitted by the client to the server as part of the query arguments. That has several implications:
A unique identifier of the configuration set is submitted by the client to the server as part of the query arguments. Implications:
I will go with variant 2 for now and use 1 as a fallback if I encounter too many obstacles.
We also need to take into account #21884
Because there have been questions from several sides about the reasons for this config structure, I'll try lay that out below.
One strength of the Elastic Stack is the flexibility it exhibits to allow users to integrate it into their own infrastructure and adapt it to their needs. To stay true to that spirit, the Infra UI will try to provide several configuration settings. For the first phase we opted for static configuration in the kibana.yml, because it is easy to implement and powerful enough for many use cases.
When interviewing users from the target group, we were consistently told that they often need to partition the logs and metrics into separate groups that correspond to sections of their infrastructure and/or teams in their organisation. That is why the configuration is structured such that the settings that relate to the consumption of data are grouped into "sources".
For each "source" the configuration must specify which indices the logs and metrics are read from. This is done by specifying a read alias for each type of data (logs and metrics). Using aliases instead of plain index patterns enables the easy implementation of a simple configuration UI that can add indices to that source as well as easy self-enrollment of data sources (e.g. metric-/filebeat via their index template).
To increase the chance that users can deploy this to handle their existing data, a few salient fields like the timestamp field and fields identifying various different entities like hosts and containers can be configured.
To provide an easy "getting started" experience, all the settings are optional. If the user does not specify any source, a "default" source as described in the section "Defaults" is assumed.
Should we switch to using the fields host.name and container.id as specified in the ECS instead of the beat-specific fields? That would be more consistent with how other apps like APM, which becomes especially important once we implement linking between them.
I would probably stick to beat.hostname for the 6.x release and change it over to host.name in 7.0. Reason is that in this case the UI would work with Beats data also older then 6.4.
Most helpful comment
With the desire in mind to allow for multiple sets of source configurations, how about something like