Logstash: Logstash Integration with Elasticsearch Data Streams

Created on 14 Aug 2020 · 23Comments · Source: elastic/logstash

Overview

This is an overview of the Logstash integration with Elasticsearch data streams. The integration will take the form of a new Elasticsearch Data Stream output plugin under the Elastic Basic license. This new plugin will be the go forward approach for indexing any time series datasets (logs, metrics, etc.) into Elasticsearch. Non-time series use cases will continue to use the existing Elasticsearch output plugin.

Indexing Strategy

This plugin will adopt the new indexing strategy under the {type}-{dataset}-{namespace} format, leveraging the composable templates bundled in Elasticsearch starting in 7.9.

The default data streams name will be logs-generic-default. This default enables users to easily correlate data with other different data sources (e.g. with logs-* and logs-generic-*) in Elasticsearch. Given the new indexing strategy, the type, dataset, and namespace of the data stream name can all be configured separately.

As Logstash will not be fully ECS compliant until 8.0, there are caveats we need to document (or provide bootstrap checks) for users to avoid ECS conflicts.

Update the Beats input, TCP input, UDP input, and grok filter. If they are using these plugins, they should enable ECS compatibility mode to avoid ECS conflicts. This is work in progress for the 7.9 / 7.10 timeframe.
Users should not introduce any ECS conflicting fields in their pipeline when using this plugin. This should be more systematic in the future when we add ECS validation.

Example Configuration

Basic default configuration

output {
    elasticsearch_data_stream {
        hosts => "hostname"    # defaults to "localhost" on port 9200
    }
}

Minimal settings to get started. Events with the data_stream.* fields will automatically get routed to the appropriate data streams. Defaults to logs-generic-logstash if the fields are missing.

Customize data stream name

output {
    elasticsearch_data_stream {
        hosts => "hostname"
        timestamp => "@timestamp"
        type => "metrics"
        dataset => "foo"
        namespace => "bar"
    }
}

Configuration Settings

Beyond the base settings we can inherit from the existing ES output, there are net new data stream specific settings:

timestamp (timestamp, required) - the timestamp used for the data stream. Defaults to @timestamp. This should be configurable in Elasticsearch 7.10.
type (string, optional) - the data stream type (only logs or metrics is allowed) used to construct the data stream at index time. This field does not support hyphens (-). Defaults to logs.
dataset (string, optional) - the data stream dataset used to construct the data stream at index time. This field does not support hyphens (-). Defaults to generic.
namespace (string, optional) - the data stream namespace used to construct the data stream at index time. This field does not support hyphens (-). Defaults to default.
auto_routing (boolean, optional) - automatically routes events by deriving the data stream name using specific event fields with the %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace} format. This setting takes precedence over the type, dataset, and namespace settings, but can fall back to them if any data_stream.* fields are absent. Defaults to true.

Additionally, there are many settings from the existing Elasticsearch output that we could consider removing with this new plugin. This is not an exhaustive list.

document_type - this is legacy cruft, types in ES are now obsolete.
action, doc_as_upsert, scripted_upsert, script, script_lang, script_type, script_var_name, version, version_type - prevent any update actions.
ilm_enabled, ilm_pattern, ilm_policy, ilm_rollover_alias - ILM options are no longer necessary.
template, template_name, template_overwrite, manage_template - template management on the LS side is no longer necessary.

Elastic Agent Compatibility

Logstash often acts as an intermediary for receiving data from other systems like the Elastic Agent and Kafka. For these use cases, Logstash will by default use the data_stream.type, data_stream.dataset, and data_stream.namespace fields to derive the data stream name. This allows events from the Elastic Agent to automatically be routed to the appropriate Elasticsearch data stream when using Logstash in between. This feature can be disabled by configuring the auto_routing setting to false.

Format: %{data_stream.type}-%{data_stream.dataset}-%{data_stream.namespace}

Logstash OSS Users

Data streams is a Basic feature, so this integration will therefore only be distributed with the default distribution of Logstash.

Limitations

The primary limitation of data streams is the ability to perform updates to the documents. Logstash users have historically used the existing Elasticsearch output plugin’s capabilities to conduct document updates and achieve exactly once delivery semantics.

Future Considerations

The logs-generic-default is the default data stream for generic data from Logstash and the Elastic Agent. If users express feedback that it’s difficult to identify Logstash sourced data from the shared data stream, we could consider adding a from-logstash tag to the tags ECS base field for events coming from Logstash.
We want to guide users towards using the new indexing strategy, but if users express the need for more flexibility, we could introduce a free form option for specifying the data stream name in the future where template/ILM management would be manual.
When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

Source

acchen97

👍7

Most helpful comment

Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same

sc7565 on 30 Dec 2020

😄1 👍1

All 23 comments

@acchen97 Two notes on the above issue:

It is data_stream.dataset and not data_stream.name
You use the config option host. I know this has been used a lot in the past but I wonder if we could change also the config name to something ECS compatible?

ruflin on 17 Aug 2020

@ruflin thanks for your notes. I've reconciled the former in the original issue. For the latter, I think we can stick with hosts as its accurately descriptive and is also consistent with how Beats and Agent configure it in the ES output. I'm not sure being ECS compatible is as critical here given this is a configuration setting rather than an event field.

acchen97 on 17 Aug 2020

👍1

@mostlyjason FYI.

ph on 3 Sep 2020

will there be cloud.id/cloud.auth support?

enotspe on 23 Sep 2020

@enotspe I would think so, I don't see why we would differ from the authentication strategies of the current elasticsearch output.

colinsurprenant on 23 Sep 2020

👍1

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

ph on 4 Nov 2020

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

And the question is more about making this a configurable default behaviour or not? i.e. should we allow the user to disable that in the case of documents not from Agent and not containing these fields?

colinsurprenant on 4 Nov 2020

And as a followup question, if the user sets auto_routing => false and the document contains the data_stream.type, data_stream.dataset, and data_stream.namespace fields should we overwrite the fields with the plugin configured values?

colinsurprenant on 4 Nov 2020

@colinsurprenant We have validation in place for the data_stream.* field in Kibana we should align on them. cc @jen-huang .

Those fields have validation on agent side too to ensure safety with ES index name constraints.

Following discussion in https://github.com/elastic/kibana/issues/75846, we implemented 20/100/100 byte length restriction for type, dataset, and namespace strings, respectively.

jen-huang on 4 Nov 2020

@ph @acchen97 @jen-huang So should we be looking into this Future Considerations item right away?

When they are absent, we could have a setting that allows the data_stream.type, data_stream.dataset, and data_stream.namespace fields to be derived from the data stream name and added to the event prior to indexing.

I'm still a bit hesitant on tackling this in the first version. This would only apply to data that is not sent from Agent, and I'm not sure how adding this would impact those use cases yet. Also, it's not clear to me if and how these data_stream.* fields will be used in ES queries and downstream UI components. Perhaps we can wait for user feedback before we decide whether we want to add it. It's typically easier to add features rather than remove them in the future. /cc @jsvd

acchen97 on 4 Nov 2020

It is important that we add these fields. The new indexing strategy requires these fields to be there. It is expected that all dashboards / visualizations we build and hopefully also the one from the community, will filter on these. It will make the queries and with the dashboards much faster. If the fields are not in line with the indexing strategy, things will break apart.

ruflin on 5 Nov 2020

As we are closing in on the release of the logstash data streams output plugin

I added a release roadmap meta issue in https://github.com/logstash-plugins/logstash-output-elasticsearch_data_streams/issues/1
and a disscuss issue about the overwriting of the data_stream.* in the event https://github.com/logstash-plugins/logstash-output-elasticsearch_data_streams/issues/2

colinsurprenant on 10 Nov 2020

@acchen97 . For a non-agent use case: We have a multi-tenant strategy where each tenant has its own index. Such as datalake-tenant1, datalake-tenant2. We use logstash to feed data and set the index to the correct tenant. Under the new indexing strategy and this plugin can we supports this model: logs-tenant-dataset? Where tenant = ecs field organization.id?

Karrade7 on 14 Nov 2020

@Karrade7 @acchen97 Good point. In the current model with auto_routing: true, using mutate filter(s) you could set the value of any of the {type}-{dataset}-{namespace} fields for example.

But we could also provide string interpolation for the type, dataset and namespace options - that way you could reference the value of any event fields. I think this makes sense, it adds even more flexibility.

colinsurprenant on 16 Nov 2020

@colinsurprenant i think string interpolation and flexibility in general will be important here.
I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

Karrade7 on 16 Nov 2020

@Karrade7 I am not sure I understand your concern correctly; there are 2 things at play here:

The plugin type, dataset and namespace options are used primarily to create the index name when auto-routing: false. The index name will always be {type}-{dataset}-{namespace} unless auto-routing: true where the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will be used and if one of these fields is missing then the corresponding plugin options will be used.

If using set_data_stream_fields: true the event fields [data_stream][type], [data_stream][dataset], [data_stream][namespace] will always be updated by the plugin to match the values used to create the index name.

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

colinsurprenant on 16 Nov 2020

I think even in dataset there will be issues without. Since dataset is not a universal standard, there will times when you want the dataset to be set a certain way, but the index to be named differently. A perfect example of this is non-compliant characters for index names. If dataset is uppercase or contains the character "-" it won't index. I ran into this in 7.9 when I tried to use a set processor to set _index as "dl-cylance-{{organization.name}}". This did not work as some organization names had upper and lowercase and they would not index at all. Just giving an example of where unexpected issues can occur and flexibility for index naming will be useful.

@Karrade7 hyphens are indeed not allowed in the dataset and namespace. @ph are there any restrictions for using uppercase letters in the new indexing strategy?

Are you saying that when not using auto_routing you would want to have the dataset option to use a value that would be different than the [data_stream][dataset] field value? If so it would certainly be possible but probably not advisable because I believe the intent for downstream usage by ES and Kibana will expect to have the ES indexed documents fields data_stream.type, data_stream.dataset, and data_stream.namespace matching the data stream index name.

@colinsurprenant I believe the data_stream.* fields will need to match the data stream name. The data_stream.* fields are constant keywords and my understanding is that the field values will need to be the same across an entire index.

acchen97 on 16 Nov 2020

👍1

Why should be "type" limited only to logs and metrics? Currently we use the similar naming but we use following options: logs, metrics, monitors (typically up/down monitors, events from heartbeat, ...) and data (real application data, not logs).

vbohata on 10 Dec 2020

@vbohata this a good question and ultimately nothing will really prevent someone from having a "custom" Data Streams type other than "logs" and "metrics" but in the short terms these are the only one that have bundled ES templates and for which some visualizations will exist. The type option might have restrictions when first released and we might allow arbitrary type in the future, this is still being evaluated.

colinsurprenant on 10 Dec 2020

in a bit confused, is this plugin already in the 7.10?

after I saw a presentation from @ruflin about data streams I started digging about how to integrate metricbeat and filebeat with the data streams...
I did some extra processing in Beat and Logstash

Here what I'm adding to the Beat config:

processors:
- add_fields:
    fields:
      namespace: default
      type: metrics
    target: data_stream
- copy_fields:
    fail_on_error: false
    fields:
    - from: event.dataset
      to: data_stream.dataset
    ignore_missing: true

And Logstash side im doing this:

elasticsearch {
  id => "elasticsearch_stream"
  hosts => 'http://tiny-master:9200'
  index => "%{[data_stream][type]}-%{[data_stream][dataset]}-%{[data_stream][namespace]}"
  manage_template => false
  action => "create"
}

in Kibana to enable all the templates and dashboards I just add a fake agent and everything is created.

Seems working only with metricbeat and filebeat, only some fields sometimes are rising shards errors:

"failures": [
      {
        "shard": 0,
        "index": ".ds-metrics-system.service-default-000001",
        "node": "wGq5lQ7BSKOYPV9_zkPQ3Q",
        "reason": {
          "type": "illegal_argument_exception",
          "reason": "Field [system.service.state_since] of type [keyword] does not support custom formats"
        }
      }
    ]

this maybe because I'm not setting correctly all the fields?

With auditbeat data streams seem not working.

cdino on 12 Dec 2020

👍2

@cdino Nice work! You don't need to create a fake Agent, if you go to Settings of an integration, you have an install button. One thing to keep in mind, there is a chance that we make some breaking changes to the packages compared to the modules. Your error might be related to this, seems like system.service.state_since the format might be different. Sounds like this should be a date field?

@cdino What you discovered is, if you know what you are doing you don't need the new plugin ;-) 👏

Curious to hear what errors you got on the auditbeat side.

ruflin on 14 Dec 2020

@ruflin Thanks! Yes i will avoid to use it in production for now :) but I really like this approach will help us a lot in the future.
I will give another try with auditbeat, but seems that there is no datastream mapping for data_stream.[dataset] related events.

cdino on 14 Dec 2020

Is the plugin released or not? Did not find repo nor details. Wanted to check roadmap for same

sc7565 on 30 Dec 2020

😄1 👍1

Was this page helpful?

0 / 5 - 0 ratings