Beats: add ability to generate UUID on documents

Created on 26 Apr 2016  路  10Comments  路  Source: elastic/beats

Beats come from various sources (log files, packets, etc) and ultimately make their way to Elasticsearch. From a best practice standpoint all source data would some type of unique ID on it so that way duplicate documents are avoided downstream. However we know this is not always the case.

Therefore adding the ability for *beats to place a UUID on the documents (regardless of the output used) greatly simplify pipelines for end users and helps them completely avoid the duplicate document problem when replaying/retrying indexing operations. The goal is the UUID would be used as the ID in Elasticsearch. The benefits are as follows:

  • by placing UUIDs on documents at the earliest possible stage of a data pipeline, we avoid duplicates at any stage after this
  • allows for simplified replay logic inside of *beats as well since worst case is a document is updated with the exact same data
  • Further customer data pipelines leveraging Kafka or Logstash can leverage that UUID for retry, dedup, and other processing.
enhancement libbeat

Most helpful comment

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

All 10 comments

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates? If it's really that simple, we should have done it long ago :-).

In either case, I agree with all your points, we should add this.

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

I've been looking at the same thing when using journald and the __CURSOR field for idempotent indexing.

+1

+1
That enhancement will allow use to get rid of logstash as a shipper only to use the uuid filter as mentionned here : https://www.elastic.co/fr/blog/just-enough-kafka-for-the-elastic-stack-part2

Now that the Elastic Stack ingest components support at-least-once delivery guarantees, having the ability to prevent duplicates by adding a unique identifier to each event at the source would be great.

We should try to ensure that the default (if applicable) is an efficient identifier from Elasticsearch's point of view.

+1

Excuse me, from filebeat --> kafka, how to add a random uuid field? How to write configuration changes?

+1
Thanks !
will we be have the ability to generate UUID on documents without relying on logstash ?

Different strategies to add document IDs have been implemented for the upcoming releases.
See the related meta issue and referenced PRs for details: https://github.com/elastic/beats/issues/14363

Was this page helpful?
0 / 5 - 0 ratings