Fluent-bit: Improve prometheus metrics exposed by fluentbit

Created on 21 Feb 2018 · 12Comments · Source: fluent/fluent-bit

I was playing recently with metrics exposed by fluentbit and I have few notes:

Golang client library of prometheus (and maybe other languages also) are exposing some metadata related to the metrics in format:
# HELP metric_name metric_description
# TYPE metric_name metric_type

Help usually is not very important (but it would be nice to have it), but type is usually very important metadata that required for metrics processing.

When some component pulls metrics from the fluentbit it should have information about the startup time. This is also important for kubernetes environment when container can be restarted. And below is the example of how it's generated by golang library:

# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.51843309241e+09

About this point I'm not sure, but why do we have timestamp attached to each metric. I just haven't seen this before, in other implementations.

enhancement fixed

Source

loburm

Most helpful comment

help/type

sounds reasonable to have, I will make sure to fix it.

startup time

I will add it.

timestamp in metrics

This is optional by the Prometheus spec:

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

edsiper on 22 Feb 2018

👍2

All 12 comments

help/type

sounds reasonable to have, I will make sure to fix it.

startup time

I will add it.

timestamp in metrics

This is optional by the Prometheus spec:

metric_name [
  "{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]

edsiper on 22 Feb 2018

👍2

I didn't know about optional timestamp, thanks for the pointer.

loburm on 23 Feb 2018

In fact timestamp should be only used in very, very rare cases. fluent-bit should certainly not add timestamps. Agreeing on help/type and startup time.

I think it would also make sense to add buffer/queue depth as @Quentin-M suggested in #233. While you can compare ingress/egress to see whether it can keep up, you verify that it's not slowly backlogging, so a queue depth/buffer gauge is require.

I'd also add a message processing duration histogram (https://prometheus.io/docs/concepts/metric_types/#histogram), which will help to spot intermediate issues with the outputs (e.g it's keeping up, but processing for a few message takes way longer than for others etc)

discordianfish on 10 Sep 2018

I ran into an issue due to the missing metric types. The Datadog agent will ignore metrics that do not have a type associated with them. It would be great to get those added.

+1 for the queue depth and processing time metrics as well. I've seen cases where huge surges of log events can cause issues.

zlangbert on 25 Sep 2018

Please add Hostname/Nodename to the metrics for an easier way to tell which node (in Kubernetes) that actually sent the logs. This makes it easier to quickly detect what node might be spamming the logs.

Zenlil on 21 Dec 2018

@Zenlil I don't agree on that approach.

Fluentbit shouldn't know where it runs. It's responsibility of metric collector to inject correct information about fluentbit instance.

loburm on 21 Dec 2018

Has there been any progress on this? Specifically adding #HELP/#TYPE lines. Fluent Bit metrics are ignored by Stackdriver Monitoring without them.

jkschulz on 15 Apr 2019

@Zenlil Agreeing with @loburm, this should come from the kubernetes SD. If you use the example prometheus config for kubernetes, you will have node labels on your metrics.

discordianfish on 17 Apr 2019

@edsiper Do you have an estimate for when the help/type lines will be added?

jkschulz on 14 May 2019

Regarding point 1, is it correct to say that every Prometheus metrics currently exposed are of type counter in regard to this classification?

chaodhib on 11 Dec 2019

Yes, it's correct.

Actually I think that this issue can be closed, as point 1 and 2 have been resolved by https://github.com/fluent/fluent-bit/pull/1456 and it is a part of 1.3 release.

loburm on 11 Dec 2019

👍1

thanks @loburm . this is already in place:

# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="cpu.0"} 6894 1576073944888
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="cpu.0"} 18 1576073944888
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0 1576073944888
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 6511 1576073944888
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 17 1576073944888
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1576073944888
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0 1576073944888
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1576073926

edsiper on 11 Dec 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings