This is an overview of the Logstash integration with Elasticsearch data streams. The integration will take the form of a new Elasticsearch Data Stream output plugin under the Elastic Basic license. This new plugin will be the go forward approach for indexing any time series datasets (logs, metrics, etc.) into Elasticsearch. Non-time series use cases will continue to use the existing Elasticsearch output plugin.
This plugin will adopt the new indexing strategy under the {type}-{dataset}-{namespace} format, leveraging the composable templates bundled in Elasticsearch starting in 7.9.
The default data streams name will be logs-generic-default. This default enables users to easily correlate data with other different data sources (e.g. with logs-* and logs-generic-*) in Elasticsearch. Given the new indexing strategy, the type, dataset, and namespace of the data stream name can all be configured separately.
As Logstash will not be fully ECS compliant until 8.0, there are caveats we need to document (or provide bootstrap checks) for users to avoid ECS conflicts.
output {
elasticsearch_data_stream {
hosts => "hostname" # defaults to "localhost" on port 9200
}
}
Minimal settings to get started. Events with the data_stream.* fields will automatically get routed to the appropriate data streams. Defaults to logs-generic-logstash if the fields are missing.
output {
elasticsearch_data_stream {
hosts => "hostname"
timestamp => "@timestamp"
type => "metrics"
dataset => "foo"
namespace => "bar"
}
}
Beyond the base settings we can inherit from the existing ES output, there are net new data stream specific settings:
timestamp (timestamp, required) - the timestamp used for the data stream. Defaults to @timestamp. This should be configurable in Elasticsearch 7.10.type (string, optional) - the data stream type (only logs or metrics is allowed) used to construct the data stream at index time. This field does not support hyphens (-). Defaults to logs.dataset (string, optional) - the data stream dataset used to construct the data stream at index time. This field does not support hyphens (-). Defaults to generic.namespace (string, optional) - the data stream namespace used to construct the data stream at index time. This field does not support hyphens (-). Defaults to default.auto_routing (boolean, optional) - automatically routes events by deriving the data stream name using specific event fields with the %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace} format. This setting takes precedence over the type, dataset, and namespace settings, but can fall back to them if any data_stream.* fields are absent. Defaults to true.Additionally, there are many settings from the existing Elasticsearch output that we could consider removing with this new plugin. This is not an exhaustive list.
document_type - this is legacy cruft, types in ES are now obsolete.action, doc_as_upsert, scripted_upsert, script, script_lang, script_type, script_var_name, version, version_type - prevent any update actions.ilm_enabled, ilm_pattern, ilm_policy, ilm_rollover_alias - ILM options are no longer necessary.template, template_name, template_overwrite, manage_template - template management on the LS side is no longer necessary.Logstash often acts as an intermediary for receiving data from other systems like the Elastic Agent and Kafka. For these use cases, Logstash will by default use the data_stream.type, data_stream.dataset, and data_stream.namespace fields to derive the data stream name. This allows events from the Elastic Agent to automatically be routed to the appropriate Elasticsearch data stream when using Logstash in between. This feature can be disabled by configuring the auto_routing setting to false.
Format: %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace}
Data streams is a Basic feature, so this integration will therefore only be distributed with the default distribution of Logstash.
The primary limitation of data streams is the ability to perform updates to the documents. Logstash users have historically used the existing Elasticsearch output plugin鈥檚 capabilities to conduct document updates and achieve exactly once delivery semantics.
logs-generic-default is the default data stream for generic data from Logstash and the Elastic Agent. If users express feedback that it鈥檚 difficult to identify Logstash sourced data from the shared data stream, we could consider adding a from-logstash tag to the tags ECS base field for events coming from Logstash.data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.@acchen97 Two notes on the above issue:
data_stream.dataset and not data_stream.namehost. I know this has been used a lot in the past but I wonder if we could change also the config name to something ECS compatible?@ruflin thanks for your notes. I've reconciled the former in the original issue. For the latter, I think we can stick with hosts as its accurately descriptive and is also consistent with how Beats and Agent configure it in the ES output. I'm not sure being ECS compatible is as critical here given this is a configuration setting rather than an event field.
@mostlyjason FYI.
will there be cloud.id/cloud.auth support?
@enotspe I would think so, I don't see why we would differ from the authentication strategies of the current elasticsearch output.
@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .
@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?
When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.
And the question is more about making this a configurable default behaviour or not? i.e. should we allow the user to disable that in the case of documents not from Agent and not containing these fields?
And as a followup question, if the user sets auto_routing => false and the document contains the data_stream.type, data_stream.dataset, and data_stream.namespace fields should we overwrite the fields with the plugin configured values?
@colinsurprenant We have validation in place for the
data_stream.*field in Kibana we should align on them. cc @jen-huang .
Those fields have validation on agent side too to ensure safety with ES index name constraints.
Following discussion in https://github.com/elastic/kibana/issues/75846, we implemented 20/100/100 byte length restriction for type, dataset, and namespace strings, respectively.
@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?
When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.
I'm still a bit hesitant on tackling this in the first version. This would only apply to data that is not sent from Agent, and I'm not sure how adding this would impact those use cases yet. Also, it's not clear to me if and how these data_stream.* fields will be used in ES queries and downstream UI components. Perhaps we can wait for user feedback before we decide whether we want to add it. It's typically easier to add features rather than remove them in the future. /cc @jsvd
It is important that we add these fields. The new indexing strategy requires these fields to be there. It is expected that all dashboards / visualizations we build and hopefully also the one from the community, will filter on these. It will make the queries and with the dashboards much faster. If the fields are not in line with the indexing strategy, things will break apart.
As we are closing in on the release of the logstash data streams output plugin
data_stream.* in the event https://github.com/logstash-plugins/logstash-output-elasticsearch_data_streams/issues/2@acchen97 . For a non-agent use case: We have a multi-tenant strategy where each tenant has its own index. Such as datalake-tenant1, datalake-tenant2. We use logstash to feed data and set the index to the correct tenant. Under the new indexing strategy and this plugin can we supports this model: logs-tenant-dataset? Where tenant = ecs field organization.id?
@Karrade7 @acchen97 Good point. In the current model with auto_routing: true, using mutate filter(s) you could set the value of any of the {type}-{dataset}-{namespace} fields for example.
But we could also provide string interpolation for the type, dataset and namespace options - that way you could reference the value of any event fields. I think this makes sense, it adds even more flexibility.
@colinsurprenant i think string interpolation and flexibility in general will be important here.
I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.
@Karrade7 I am not sure I understand your concern correctly; there are 2 things at play here:
type, dataset and namespace options are used primarily to create the index name when auto-routing: false. The index name will always be {type}-{dataset}-{namespace} unless auto-routing: true where the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will be used and if one of these fields is missing then the corresponding plugin options will be used. set_data_stream_fields: true the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will always be updated by the plugin to match the values used to create the index name. Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.
I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.
@Karrade7 hyphens are indeed not allowed in the dataset and namespace. @ph are there any restrictions for using uppercase letters in the new indexing strategy?
Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.
@colinsurprenant I believe the data_stream.* fields will need to match the data stream name. The data_stream.* fields are constant keywords and my understanding is that the field values will need to be the same across an entire index.
Why should be "type" limited only to logs and metrics? Currently we use the similar naming but we use following options: logs, metrics, monitors (typically up/down monitors, events from heartbeat, ...) and data (real application data, not logs).
@vbohata this a good question and ultimately nothing will really prevent someone from having a "custom" Data Streams type other than "logs" and "metrics" but in the short terms these are the only one that have bundled ES templates and for which some visualizations will exist. The type option might have restrictions when first released and we might allow arbitrary type in the future, this is still being evaluated.
in a bit confused, is this plugin already in the 7.10?
after I saw a presentation from @ruflin about data streams I started digging about how to integrate metricbeat and filebeat with the data streams...
I did some extra processing in Beat and Logstash
Here what I'm adding to the Beat config:
processors:
- add_fields:
fields:
namespace: default
type: metrics
target: data_stream
- copy_fields:
fail_on_error: false
fields:
- from: event.dataset
to: data_stream.dataset
ignore_missing: true
And Logstash side im doing this:
elasticsearch {
id => "elasticsearch_stream"
hosts => 'http://tiny-master:9200'
index => "%{[data_stream][type]}-%{[data_stream][dataset]}-%{[data_stream][namespace]}"
manage_template => false
action => "create"
}
in Kibana to enable all the templates and dashboards I just add a fake agent and everything is created.
Seems working only with metricbeat and filebeat, only some fields sometimes are rising shards errors:
"failures": [
{
"shard": 0,
"index": ".ds-metrics-system.service-default-000001",
"node": "wGq5lQ7BSKOYPV9_zkPQ3Q",
"reason": {
"type": "illegal_argument_exception",
"reason": "Field [system.service.state_since] of type [keyword] does not support custom formats"
}
}
]
this maybe because I'm not setting correctly all the fields?
With auditbeat data streams seem not working.
@cdino Nice work! You don't need to create a fake Agent, if you go to Settings of an integration, you have an install button. One thing to keep in mind, there is a chance that we make some breaking changes to the packages compared to the modules. Your error might be related to this, seems like system.service.state_since the format might be different. Sounds like this should be a date field?
@cdino What you discovered is, if you know what you are doing you don't need the new plugin ;-) 馃憦
Curious to hear what errors you got on the auditbeat side.
@ruflin Thanks! Yes i will avoid to use it in production for now :) but I really like this approach will help us a lot in the future.
I will give another try with auditbeat, but seems that there is no datastream mapping for data_stream.[dataset] related events.
Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same
Most helpful comment
Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same