Beats: add ability to generate UUID on documents

Created on 26 Apr 2016 · 10Comments · Source: elastic/beats

Beats come from various sources (log files, packets, etc) and ultimately make their way to Elasticsearch. From a best practice standpoint all source data would some type of unique ID on it so that way duplicate documents are avoided downstream. However we know this is not always the case.

Therefore adding the ability for *beats to place a UUID on the documents (regardless of the output used) greatly simplify pipelines for end users and helps them completely avoid the duplicate document problem when replaying/retrying indexing operations. The goal is the UUID would be used as the ID in Elasticsearch. The benefits are as follows:

by placing UUIDs on documents at the earliest possible stage of a data pipeline, we avoid duplicates at any stage after this
allows for simplified replay logic inside of *beats as well since worst case is a document is updated with the exact same data
Further customer data pipelines leveraging Kafka or Logstash can leverage that UUID for retry, dedup, and other processing.

enhancement libbeat

Source

djschny

👍22

Most helpful comment

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

djschny on 27 Apr 2016

👍4

All 10 comments

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates? If it's really that simple, we should have done it long ago :-).

In either case, I agree with all your points, we should add this.

tsg on 27 Apr 2016

@djschny would passing an UUID for as the _id field to Elasticsearch get rid of duplicates?

Yep should, but I believe v5.0.0 might require it to only be in the URL. I'll need to check that.

djschny on 27 Apr 2016

👍4

I've been looking at the same thing when using journald and the __CURSOR field for idempotent indexing.

mkocikowski on 26 Jul 2016

jpvillalobos on 18 Oct 2016

+1
That enhancement will allow use to get rid of logstash as a shipper only to use the uuid filter as mentionned here : https://www.elastic.co/fr/blog/just-enough-kafka-for-the-elastic-stack-part2

nicolasguyomar on 1 Dec 2016

Now that the Elastic Stack ingest components support at-least-once delivery guarantees, having the ability to prevent duplicates by adding a unique identifier to each event at the source would be great.

We should try to ensure that the default (if applicable) is an efficient identifier from Elasticsearch's point of view.

cdahlqvist on 29 Aug 2017

👍1

alexandrejuma on 15 Sep 2017

Excuse me, from filebeat --> kafka, how to add a random uuid field? How to write configuration changes?

xuyangxy on 21 Dec 2018

+1
Thanks !
will we be have the ability to generate UUID on documents without relying on logstash ?

shushantan on 12 Nov 2019

Different strategies to add document IDs have been implemented for the upcoming releases.
See the related meta issue and referenced PRs for details: https://github.com/elastic/beats/issues/14363