Logstash: Dead letter queue, resurrected

Created on 14 Feb 2017  路  5Comments  路  Source: elastic/logstash

Dead Letter Queue (DLQ) Feature

Ability to shunt poisoned or unsuccessful events in the running pipeline to a new destination for further processing. This feature allows the existing pipeline to continue processing events without getting stuck because of a bad event.

Phase 1 Goals:

Storing dead events

Goals from the producer side, which is a Logstash instance creating dead letter queue events.

  • Production Logstash shouldn't be stopped to consume or manage the DLQ.
  • DLQ should be local to the Logstash instance. This means we should have a file based implementation where events get delivered.
  • The DLQ should rely on OS managed flush mechanism for durability.
  • DLQ has an inbuilt rotation policy to manage its file size. Without rotation, file size can get too big and would need manual curation. Once the threshold has been reached, the DLQ would behave like a ring buffer. Older events would be overwritten to accommodate new events.

Processing dead events

Once the DLQ is filled with events, users should be able to query, analyze and process events. What a user does with the dead event depends on the original error. For example, a mapping error in ES would create a DLQ entry. The user can then choose to remove the field causing the mapping issue and re-index the _cleaned_ event to ES.

  • A new DLQ input should be created to process events in DLQ. Multiple concurrent consumers (DLQ inputs) should be supported.
  • Users of DLQ input should only care about top-level DLQ path and not the underlying folder structure. This input should work even if the original Logstash that created the dead events is shutdown.
  • Each DLQ input should manage its own offsets (like SinceDB). It should support disabling offset management so users can stream events from the DLQ for a _dry run_, without keeping states.
  • DLQ input supports sequential access to read from DLQ for phase 1

Sample configuration

logstash.yml

feature.dead_letter_queue: true
path.dead_letter_queue: /path/to/dlq

DLQ input

input {
   dlq {
       path => /path/to/dlq
       timestamp => "2017-04-04T23:40:37"
    }
}

filter {}
output {
   elasticsearch {}
}

Details here: https://github.com/logstash-plugins/logstash-input-dead_letter_queue/blob/master/lib/logstash/inputs/dead_letter_queue.rb#L5


Future Goals

  • Support retention with dead_letter_queue.size.threshold: 24h
  • Support range query in the DLQ input
  • Support a direct ES output
  • Plugins should be able to publish dead events to separate "topics/tags". This would allow for easy consumption later on, based on filtering by the type of plugin. Each plugin in Logstash will create a separate dead letter entry grouped by plugin name and stage. This physical separation on the file system allows for faster query/filtering when processing events from DLQ. Sequential access (having all events in one single file) would work but will make processing slower
    on the consumer side. Rotation threshold is applied per tag, not globally.

Directory Structure (Future)

path.dlq
   |
   pipeline_id
      | 
       outputs_elasticsearch
       inputs_kafka

where outputs_elasticsearch contains events DLQ'd from ES outputs from the original production LS.

enhancement

Most helpful comment

First version of DLQ was merged into master/5.x(5.5)

reference PR that added the API: https://github.com/elastic/logstash/pull/6894

All 5 comments

/cc @acchen97 @jordansissel @colinsurprenant

Looks good @suyograo thanks for writing this up. Few comments below:

DLQ has an inbuilt rotation policy to manage its file size. Without rotation, file size can get too big and would need manual curation. Once the threshold has been reached, the DLQ would behave like a ring buffer. Older events would be overwritten to accommodate new events.

I see one new setting for this and wondering if we need two. One for when the file rotates (e.g. every day), and the other maybe a max disk size allocated setting (e.g. 16gb) for DLQ that would execute the ring buffer behavior when hit? This would help mitigate out of disk space scenarios.

Plugins should be able to publish dead events to separate "topics/tags". This would allow for easy consumption later on, based on filtering by the type of plugin. Each plugin in Logstash will create a separate dead letter entry grouped by plugin name and stage. This physical separation on the file system allows for faster query/filtering when processing events from DLQ. Sequential access (having all events in one single file) would work but will make processing slower on the consumer side. Rotation threshold is applied per tag, not globally.

I'm glad we're able to separate these by topics. Segregating them by stage and plugin name I think is a good approach, and will provide enough granularity for easier reprocessing. Can we also enumerate what metadata we're storing in the DLQ? i.e. timestamp, error condition, etc.

@acchen97 Why would we rotate files by time? This sounds like something a user would probably want to configure, and I wonder if we even want to have this feature? I was hoping we could have something simpler like "The DLQ should only keep as much data as satisfies {retention policy}" where the retention policy can be something based on size of the dlq, age of things in the queue, etc. Specifics of "rotate files" are into the weeds as far as what kind of things I want users to concern themselves with.

@jordansissel since we can query by tags now, abstracting the rotation strategy from users would be ok.

In terms of the retention policy, we should have the option of constraining by both time and/or disk size. We can put the disk usage threshold as a roadmap item if we think it's out of scope of phase 1.

First version of DLQ was merged into master/5.x(5.5)

reference PR that added the API: https://github.com/elastic/logstash/pull/6894

Was this page helpful?
0 / 5 - 0 ratings