I was playing recently with metrics exposed by fluentbit and I have few notes:
Help usually is not very important (but it would be nice to have it), but type is usually very important metadata that required for metrics processing.
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.51843309241e+09
sounds reasonable to have, I will make sure to fix it.
I will add it.
This is optional by the Prometheus spec:
metric_name [
"{" label_name "=" `"` label_value `"` { "," label_name "=" `"` label_value `"` } [ "," ] "}"
] value [ timestamp ]
I didn't know about optional timestamp, thanks for the pointer.
In fact timestamp should be only used in very, very rare cases. fluent-bit should certainly not add timestamps. Agreeing on help/type and startup time.
I think it would also make sense to add buffer/queue depth as @Quentin-M suggested in #233. While you can compare ingress/egress to see whether it can keep up, you verify that it's not slowly backlogging, so a queue depth/buffer gauge is require.
I'd also add a message processing duration histogram (https://prometheus.io/docs/concepts/metric_types/#histogram), which will help to spot intermediate issues with the outputs (e.g it's keeping up, but processing for a few message takes way longer than for others etc)
I ran into an issue due to the missing metric types. The Datadog agent will ignore metrics that do not have a type associated with them. It would be great to get those added.
+1 for the queue depth and processing time metrics as well. I've seen cases where huge surges of log events can cause issues.
Please add Hostname/Nodename to the metrics for an easier way to tell which node (in Kubernetes) that actually sent the logs. This makes it easier to quickly detect what node might be spamming the logs.
@Zenlil I don't agree on that approach.
Fluentbit shouldn't know where it runs. It's responsibility of metric collector to inject correct information about fluentbit instance.
Has there been any progress on this? Specifically adding #HELP/#TYPE lines. Fluent Bit metrics are ignored by Stackdriver Monitoring without them.
@Zenlil Agreeing with @loburm, this should come from the kubernetes SD. If you use the example prometheus config for kubernetes, you will have node labels on your metrics.
@edsiper Do you have an estimate for when the help/type lines will be added?
Regarding point 1, is it correct to say that every Prometheus metrics currently exposed are of type counter in regard to this classification?
Yes, it's correct.
Actually I think that this issue can be closed, as point 1 and 2 have been resolved by https://github.com/fluent/fluent-bit/pull/1456 and it is a part of 1.3 release.
thanks @loburm . this is already in place:
# TYPE fluentbit_input_bytes_total counter
fluentbit_input_bytes_total{name="cpu.0"} 6894 1576073944888
# TYPE fluentbit_input_records_total counter
fluentbit_input_records_total{name="cpu.0"} 18 1576073944888
# TYPE fluentbit_output_errors_total counter
fluentbit_output_errors_total{name="stdout.0"} 0 1576073944888
# TYPE fluentbit_output_proc_bytes_total counter
fluentbit_output_proc_bytes_total{name="stdout.0"} 6511 1576073944888
# TYPE fluentbit_output_proc_records_total counter
fluentbit_output_proc_records_total{name="stdout.0"} 17 1576073944888
# TYPE fluentbit_output_retries_failed_total counter
fluentbit_output_retries_failed_total{name="stdout.0"} 0 1576073944888
# TYPE fluentbit_output_retries_total counter
fluentbit_output_retries_total{name="stdout.0"} 0 1576073944888
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1576073926
Most helpful comment
sounds reasonable to have, I will make sure to fix it.
I will add it.
This is optional by the Prometheus spec: