Telegraf: Output buffer persistence

Created on 7 Mar 2016 · 27Comments · Source: influxdata/telegraf

In order to avoid dropping data from the output buffer because of telegraf service restart / extended connectivity loss with consumers / any other unexpected incident, there should be an option to enable persistence of the output buffer on disk.

Enabling such a feature will introduce I/O dependencies for Telegraf, so it should be optional and most probably disabled by default. Persistence should be enabled on a per output plugin basis, depending on whether dropping data is critical or not.

Proposed config file sample:

[agent]
max_buffer_limit = 1000

[[outputs.influxdb]]
...
persist_buffer = true

[[outputs.graphite]]
...
persist_buffer = false

@sparrc thoughts?

areagent feature request pcore capability

Source

kostasb

👍35 🎉1

Most helpful comment

Hey, I'm going to reopen this because I don't think #4938 addresses this issue. Slow outputs cause metrics to be dropped without blocking inputs. This ticket is asking for metric durability for outputs.

This request isn't unreasonable, it just hasn't been a high priority. It might be helpful to take a minute to summarize my thoughts on this, some of the concerns around how to address it, and what should be kept in mind when addressing it. I guess you all could use an update after 4 years.

Telegraf currently tries its best to balance keeping up with the metric flow and storing metrics that haven't been written to an output yet, up to the metric_buffer_limit; past that limit, we intentionally drop metrics to avoid OOM scenarios. At some point, it's reasonable to think Telegraf will never catch up and it should just do its best from that point on.

A review of the concerns at play:

backpressure: slow outputs should slow down inputs so we don't unnecessarily ingest too much and then drop it
slow consumers: slow outputs should not be allowed to block faster outputs. If they fall too far behind they should start discarding old (less relevant) metrics.
out of order: metrics should generally be delivered in order, but this becomes either impossible or undesirable once you fall too far behind (the most relevant data is the latest).
user preference: some Telegraf users have preferences for whether they want to prioritize durability or performance, and whether they're willing to sacrifice metric order to do so.
message-tracking: Telegraf currently uses a feature called "metric tracking" internally for scenarios where certain inputs (like message queues) want some sort of durability; We don't ACK the message until it's written to the output. This also doubles as a back-pressure feature, as you can limit the number of un-ACKed metrics. Not all inputs have this turned on.

It's not entirely easy to weave _durability_ into that. There's a few potential options for what to implement:

best-scenario "durability": on shutdown Telegraf saves the output buffers to disk before quitting. This isn't real durability, but it might be what some users want.
real output durability: Telegraf writes all output-buffered messages to disk (but no durability for non-output messages). One could imagine non-trivial cost and non-trivial implementation here.
real full-telegraf-stack durability: Telegraf writes all incoming messages to disk, all transformations to disk, and only removes them after it's sure they're accepted downstream, forcing backpressure everywhere to ensure it doesn't over-consume metrics in flight.

This issue describes # 2. I don't think # 3 is generally all that useful for metric data, and I can't help thinking that # 1 will cause more problems than it solves.

ssoroka on 28 Jul 2020

👍7

All 27 comments

Is there a plan for this feature? @sparrc

joezhoujinjing on 22 Nov 2016

nope, sorry, it will be assigned a milestone when there is

sparrc on 22 Nov 2016

I use Kafka for this and then use telegraf to read / write to it. Kafka is great as a store to persist data to and making that available for others and also to set custom retention policies on topics. As kafka is 'free' under the Apache License why re-write another excellent solution already exists. Telegraf supports both input and output to Kafka and Kafka is a very versatile / scaleable product for this kind of purpose .

biker73 on 9 Jan 2017

Kafka still doesn't solve the problem when Kafka itself becomes an issue. There should be some way to enable on-disk persistence (with some limit) so that data isn't lost in the event that an output becomes temporarily unavailable.

bfgoodrich on 24 May 2017

👍5

+1 for persistent buffers, this is a very useful feature.

kt97679 on 6 Apr 2018

Elastic just added this capability to Beats. https://github.com/elastic/beats/pull/6581

Just noting here as maybe parts of their implementation could be useful.

voiprodrigo on 13 Apr 2018

Anything planned for this?

voiprodrigo on 28 Sep 2018

Jaeyo on 28 Sep 2018

Maybe for 2.0? :)

voiprodrigo on 7 Feb 2019

Maybe, this is not high priority right now and requires a substantial amount of design work. One aspect that has changed is that Telegraf will now only acknowledge messages from a queue after it they have been processed (sent from all outputs or filtered), it should be possible to use a queue to transfer messages durably with Telegraf.

danielnelson on 7 Feb 2019

👍4

Any suggestions on picking a simple single instance message queue?

PWSys on 21 Jan 2020

@PWSys: I shortly did some tests with the following setup:
Data --> telegraf --> RabbitMQ --> telegraf --> influxdb
using the AMQP input and output plugin.

It worked but I decided not use it because it adds too much complexity.
Since all your data is stored in RabbitMQ you need to configure and operate it properly. This was quite a challenge for me since I never used RabbitMQ before. Maybe you have more experience with it.

See RabbitMQ config and persistence.

markusr on 23 Jan 2020

@markusr Thanks for the info!

I was also looking at this, but instead with a single instance of Kafka. It can be deployed fairly simply as a container, but like you, I question the complexity. Ultimately, whether or not it will decrease the system resiliency.

PWSys on 27 Jan 2020

Taken care of by #4938

darinfisher on 28 Jul 2020

A review of the concerns at play:

backpressure: slow outputs should slow down inputs so we don't unnecessarily ingest too much and then drop it
slow consumers: slow outputs should not be allowed to block faster outputs. If they fall too far behind they should start discarding old (less relevant) metrics.
out of order: metrics should generally be delivered in order, but this becomes either impossible or undesirable once you fall too far behind (the most relevant data is the latest).
user preference: some Telegraf users have preferences for whether they want to prioritize durability or performance, and whether they're willing to sacrifice metric order to do so.
message-tracking: Telegraf currently uses a feature called "metric tracking" internally for scenarios where certain inputs (like message queues) want some sort of durability; We don't ACK the message until it's written to the output. This also doubles as a back-pressure feature, as you can limit the number of un-ACKed metrics. Not all inputs have this turned on.

It's not entirely easy to weave _durability_ into that. There's a few potential options for what to implement:

best-scenario "durability": on shutdown Telegraf saves the output buffers to disk before quitting. This isn't real durability, but it might be what some users want.
real output durability: Telegraf writes all output-buffered messages to disk (but no durability for non-output messages). One could imagine non-trivial cost and non-trivial implementation here.
real full-telegraf-stack durability: Telegraf writes all incoming messages to disk, all transformations to disk, and only removes them after it's sure they're accepted downstream, forcing backpressure everywhere to ensure it doesn't over-consume metrics in flight.

This issue describes # 2. I don't think # 3 is generally all that useful for metric data, and I can't help thinking that # 1 will cause more problems than it solves.

ssoroka on 28 Jul 2020

👍7

A buffer to disk much like persistent Que's on logstash would be great. I run an ISP and when a tower goes down I heavily rely on backfill on my Backbone dishes to save the day the issue is when the downtime is too long I run out of memory or lose metrics.

I think When the metric Buffer in memory is full there should be a disk metric buffer option and only after the in-memory buffer is full then it starts writing to disk overflow to disk, I think having this intelligence on writing to memory till that buffer is hit can help avoid disk-related slowdowns or thrashing of emmc in the case of my setup.

Looking back in the thread this does look like a feature people are looking for.

mrdavidaylward on 12 Aug 2020

I think there's a balance that could be struck here: best-effort storing of metrics that don't fit in the buffer, maybe with some kind of modified tail to read in the records from disk. inputs.tail has backpressure built into it, so it will naturally not get ahead of itself (it will avoid consuming too much and avoid dropping metrics).

based on that, a potential solution could be:

add agent.metric_buffer_overflow_strategy and default it to drop (current behavior)
support agent.metric_buffer_overflow_strategy = "disk"
- (Future support for "backpressure" could be interesting)
- "disk" would write metrics that don't fit in the buffer to disk
- support agent.metric_buffer_overflow_path = "/some/filesystem/path"
add support to inputs.tail for renaming/deleting/moving files after Telegraf has processed them
- the only snag here is how to figure out that the file is finished being written to? Last write timestamp is older than x, where x is configurable?
- require the use of inputs.tail to process the backlog. Could maybe do this transparently based on agent config?

Will think this over and run it past the team.

ssoroka on 13 Aug 2020

👍4

Connecting this issue: https://github.com/influxdata/telegraf/issues/2679
When something changes for an output that requires a config reload, maybe on SIGHUP, buffers are written to disk, then immediately processed with the new config. Maybe there is a new config option for this?

in addition to path, you might need some of the other options from the file output for limiting size and/or rotation. https://github.com/influxdata/telegraf/tree/master/plugins/outputs/file#configuration

Is the behavior to store metrics in a memory queue and "flush" those metrics to disk once the limit has been hit? Then continue filling the in memory queue again? When the connection is restored, the process is reversed until all files are processed? File(s) would be processed and removed once it is confirmed a successful response from the output.

I assume there would be one file per output plugin, similar to one buffer per output we have today? and some naming convention for duplicate output configs (two influxdb outputs for example).

russorat on 13 Aug 2020

👍1

@darinfisher This sounds like it's overloaded, and wouldn't ever catch up? Would be interested to see if the metric input rate is spiky.

ssoroka on 13 Aug 2020

@russorat

Is the behavior to store metrics in a memory queue and "flush" those metrics to disk once the limit has been hit?

Sort of. Right now if the queue is full we drop the message. It'd be easily enough to redirect that message to disk.

Then continue filling the in memory queue again?

this would always be the default.

When the connection is restored, the process is reversed until all files are processed?

I explain this below.

File(s) would be processed and removed once it is confirmed a successful response from the output.

yes.

We can easily write to disk when an output buffer is full (maybe even via outputs.file). Inputs.tail supports backpressure, so it could be used to read the files back in without dropping metrics. Essentially these things would always run and only metrics that don't fit in memory would be routed to this disk-buffered loop.

some challenges:

trying to do it without running out of disk space.
multiple output plugins can have a copy of the same metric. Some could succeed while others fail. This takes fixed memory in process, but you require one copy per output on disk (and on reload) if multiple outputs fail. This might be okay. if you have 20 outputs that all go down, expect the metrics on disk to take up 20x more space than they did in memory.

can't think of anything else, so I guess that means it will work great.

ssoroka on 13 Aug 2020

I'm leaning more towards making this an input plugin problem. It's up to the input plugin to either apply backpressure to its source, or store locally. So from the rest of Telegraf, it's applying the back pressure to the input plugin and then the input plugin hides whether it is actually applying a back pressure or just writing to disk and it will later on read back from disk.

Making it a problem on the output side seems more complicated than letting the input side take care of it.

barbaranelson on 14 Aug 2020

👍3

I mean running two telegrafs one writing to file only and the other reading and sending would work. lol, one of those woke up at 2 am and though this was a good idea moment. Although it is kinda a smart stupid idea. ;) besides all the flaws..

mrdavidaylward on 14 Aug 2020

Couple things to keep in mind:

There's some challenges around multiple outputs that make dealing with this at the input side problematic: one output could deliver and another output fails.
ideally we should avoid processors running on the same metric twice (though we sort of have this already with aggregators); while generally not harmful, it's a waste of resources. Actually, this can be a huge problem for math type processors that expect to run only once (x * 10 is not idempotent). Running them twice will give erroneous results (opened new bug #7993).
under backpressure; when one output stops, all outputs stop receiving metrics from all inputs until the downed output comes back up. Generally this trait is undesirable, and bugs have been opened in the past to eliminate this kind of stop-processing-and-wait behavior. With backpressure turned on for an input, that behavior is unavoidable for that input and all its downstream outputs. An output buffer mitigates this. rather than defaulting right to backpressure and forcing everything to slow down, you spool to disk and the flow keeps going (at least until the disk is full).

ssoroka on 16 Aug 2020

👍3

Ideally, this would be a per/output memory/disk buffer so that when issues occur writing to one output the system remains operational for all other outputs and the buffer writes to disk until that particular output becomes operational again. I would think there would be some disk high watermark that, once reached, would drop messages that are being written to that output instead of writing them to a disk buffer. I would also think a global max memory threshold and a max disk threshold would make sense for the combined per output buffers. Once the memory buffer is full then it would spill over to the disk buffer.

bfg111 on 17 Aug 2020

👍1

I think this is a very useful feature to avoid data loss for critical data and still keep a simple and robust data pipeline. Is there any plan to include this?

I have the following scenario: a low spec hardware appliance at the edge collecting metrics in influxdb that needs to push the data to a central server. The network connection is intermittent and the hardware appliance may be restarted. It will be good to have an option at telegraf to retain not sent data after loss of communication or restart of device. The data are critical and must be retained. Also, due to low device specs (60GB disk, 4GB RAM, 2 cpu cores) is cannot run apache kafka. One will need to investigate other options (rabbitmq or other) so as to have this option and it will be nice to avoid adding more components into the mix.

rightkick on 1 Oct 2020

👍1

Ended up adding rabbitMQ into the mix to ensure data persistence at Telegraf buffer.

rightkick on 16 Nov 2020

I agree it's important specially if using external store DB such as influxcloud, local DC can have connectivity issues, why loose metrics or manage external components.