Consul: A lot of Prometheus metrics are not initialized

Created on 27 Dec 2018 · 13Comments · Source: hashicorp/consul

Overview of the Issue

Most of the telemetry metrics are missing when they are fetched by Prometheus because:

Only changed metrics are initialized on Consul start
After prometheus_retention_time expires they also get removed

This is a bad practice for Prometheus. Prometheus expect all metrics to be there all the time, even if they are constantly 0, because if they are not there, it's hard to write proper alerts.

For example if I wanted to have an alert which tells me that I have a leader it could look something like this:

consul_raft_state_leader > 0 and consul_raft_state_candidate == 0

This probably isn't the correct way to check this, I didn't test it that much, because I don't see these metrics any more because of retention time. Even if I set retention to a higher value, not all of these values will be available when the server starts. For example consul_raft_state_leader will be missing on hosts which are started and don't become leaders, it will only instantiate when a node becomes a leader. It's the same case for all metrics.

Why not instantiate all metrics all the time, drop prometheus_retention_time and keep them all the time?

Reproduction Steps

Just run Consul in server mode and enable telemetry metrics and set prometheus_retention_time > 0.

Consul info for both Client and Server

Client info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 2
build:
        prerelease = 
        revision = 0bddfa23
        version = 1.4.0
consul:
        acl = enabled
        known_servers = 5
        server = false
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 48
        max_procs = 4
        os = linux
        version = go1.11.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 44
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3537
        members = 595
        query_queue = 0
        query_time = 67

Server info

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 3
build:
        prerelease = 
        revision = 0bddfa23
        version = 1.4.0
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.130.42.1:8300
        server = true
raft:
        applied_index = 7994308
        commit_index = 7994308
        fsm_pending = 0
        last_contact = 15.405233ms
        last_log_index = 7994308
        last_log_term = 104
        last_snapshot_index = 7993052
        last_snapshot_term = 98
        latest_configuration = [{Suffrage:Voter ID:b7d504c0-c8bd-6f6f-3879-a64584424560 Address:10.130.42.1:8300} {Suffrage:Voter ID:23f079be-a5c2-778c-1297-c6ef9632ba1f Address:10.130.42.2:8300} {Suffrage:Voter ID:8bff380e-8abc-4624-1d98-f35ca8e2a5ef Address:10.130.42.3:8300} {Suffrage:Voter ID:7c91a590-1986-cb35-f8ce-dbab3abe496e Address:10.130.42.4:8300} {Suffrage:Voter ID:7ff9aadc-2ec7-983f-76be-cc2f0e99df1c Address:10.130.42.0:8300}]
        latest_configuration_index = 7993763
        num_peers = 4
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 104
runtime:
        arch = amd64
        cpu_count = 4
        goroutines = 1230
        max_procs = 4
        os = linux
        version = go1.11.1
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 44
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3537
        members = 595
        query_queue = 0
        query_time = 67
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 240
        members = 5
        query_queue = 0
        query_time = 1

themtelemetry typenhancement

Source

kustodian

👍5

Most helpful comment

I disagree on both these points.

Most metrics are things like HTTP endpoints (known+documented), SERF and RAFT metrics (known, documented). Only a very small number of metrics has ad-hoc labels applied to them (notable metrics dealing with cross-DC requests).

The metrics should be non-ephemeral. There are no monitoring systems out there that get into trouble for a metric constantly emitting zeroes. But many (RRD, Prometheus, Wavefront, Circonus, for example) do not play nice with metric values that are forgotten.
The proper way to handle metrics in prometheus is to declare them as stateful objects. They are then written to appropriately.
I understand you use go-metrics as an abstraction layer, you'll have to figure out how to operate this appropriately.
I completely agree with @kustodian that the current way of exposing prometheus metrics is incorrect/improper for prometheus, as well as that is exposes a lot of headaches with respect to monitoring and operating consul. :/

A comment that I do not understand, is "save metrics at shutdown or something similar".
Modern monitoring systems are explicitly designed to have as little state in the monitored binary (consul in this instance) and instead solving all this in the monitoring system. There is no saving required (and in fact, most systems may behave in unexpected ways when metrics are saved && resumed).

nahratzah on 7 Jan 2019

👍3

All 13 comments

I'm not familiar enough with Prometheus to say confidently if this is possible for us to change without effecting other metrics providers but will tag @pierresouchay to get his opinion on this.

pearkes on 3 Jan 2019

@pearkes @kustodian unfortunately, with the current abstraction used for metric, it sounds difficult to:

know in advance metrics that will be outputted
save metrics at shutdown or something similar

In our setup, we use very large retention time and it is good enough for us (you might use for instance 1 month)

I don't have any other easy solution than this (but that's why I let you configure the retention time)

pierresouchay on 3 Jan 2019

I disagree on both these points.

nahratzah on 7 Jan 2019

👍3

@nahratzah you hit the nail on the head with the go-metrics thing. Pierre was not as I understand it defending the current situation as being ideal, more sharing his current workaround which is sufficient for him.

We all agree that declaring them up-front is the "correct" thing to do for prometheus and would be ideal but changing go-metrics's abstraction sufficiently to allow for it is a lot of work in an upstream lib and then a lot of refactoring in Consul to use that new abstraction. I hope it will be done eventually, but hard to know how to prioritize it!

Contributions or thoughts on how to allow that in go-metrics are very welcome though!

banks on 7 Jan 2019

❤1

Does anyone know whether using the Consul Exporter(https://github.com/prometheus/consul_exporter) would overcome this issue? That is, does it access some other API or endpoint such that it can at least offer all the metrics it advertises? Just curious because I think trying to define alerts within Prometheus and develop/test Grafana dashboards will be difficult, if we can't really see some of the metrics we're interested in. Thanks!

ntgdi on 9 Jan 2019

@ntgdi : The way I see consul_exporter, I would say it monitors (consul's idea of) services registered.
So it would track the load-balancer service, web-server service, database service, memcache service etc, for things like how many are available and their health check state.

It does expose some consul information (notable Raft), but I would not say that's sufficient to use to monitor a consul cluster itself.

nahratzah on 11 Jan 2019

@nahratzah Appreciate the response. Yeah, as to whether it will meet our monitoring requirements for Consul, I'll leave that to the PM. ;) However, from a functional perspective, I did notice that the consul_exporter hits different API endpoints(than the Telemetry endpoint), so I still have hope that all the metrics that the consul_exporter advertises will be available on every scrape.

ntgdi on 11 Jan 2019

Yes, prometheus_exporter does the right thing. <3

nahratzah on 11 Jan 2019

👍1

@nahratzah Prometheus_exporter overrides all metrics you need?
I think the telemetry endpoint(metrics) provided by the consul system are not very comprehensive

stevenlee87 on 21 Jan 2019

No, we never said prometheus_exporter gives us all metrics we require. In fact, it can't do that, as that would require reaching deep into the internals of consul. :)
But what it does regarding metrics, it does right. It does not lose track of metrics or reset them.

Many of the metrics we require, we require so we can set up SLA and set up expectations (that we can be beholden to and fulfill) for clients. Consul's metric system is inadequate for this use case. :(

nahratzah on 23 Jan 2019

@nahratzah What are the metrics that you absolutely need as initialized?

On our side, as I explained, we are using a very large retention time and it is not a very big deal (you can use 365 days if you want to). Since our clusters are usually quite loaded with recurrent patterns (I mean, the calls are most of the time always the same), data is coming back very quickly.

Some metrics are quite ephemeral in their nature (ex: consul_health_service_query{service="my_database"}, for this, we probably cannot do anything easily, but if you have specific ones you really need, I might try to find a solution to initialize those at startup (or when data is cleared after retention time has elapsed).

pierresouchay on 23 Jan 2019

That is not a small list...

So, this is from the epic we use:

A metric with the consul version as a label, to identify missed upgrades etc.
Anything that is time based (ex: request duration, tx duration) should be a histogram.
Anything that is an HTTP API request needs to be labeled with the HTTP return code, and labeled such as to distinguish between blocking and non-blocking requests.
I need to know what number of servers a consul cluster expects (hooked into the relevant bit of code, not just a config bit, in case the algorithm varies over time).
I need a gauge for the number of inflight blocking requests on a per-endpoint basis.
Measurement on what the delay is between a write operation, and a client being informed of that write operation.

And this is from what I suspect we'll need:

request rate per user (i.e. on a per-token basis)
HTTP API requests further broken down based on:
- if the request came from another consul instance (i.e. consul internal traffic)
- if the request came from a user (ex: consul-template or any other binary talking to consul)
- if the request was from a remote DC
metrics indicating if my LAN and WAN gossip meshes are fully connected and stable
metrics related to Raft.
metrics related to propagation delays for Serf.

And we need these metrics to all be stable.
Like, if we use a prometheus query avg(some_metric) it should correctly compute the average, not "the average except for the metrics that haven't changed for so long that we forgot them". :)

Our current dashboard is full of charts where we're write, like "empty is good" at the top. (Example: our election chart).
The thing is, I can not distinguish between an "empty, everything is good" metric and an "empty because the binary is hanging" metric. (And yeah, we've had complete consul outages due to deadlock, where metrics were completely useless.)

But fixing all this is not a single PR kind of task. I've planned multiple quarters for this in just my team.
And this is important stuff for using consul in production and being able to rely on it. :)
Which is why a response that it's not on the roadmap is exceptionally disappointing:

I hope it will be done eventually, but hard to know how to prioritize it!

nahratzah on 23 Jan 2019

👍2

This tripped me today because a leadership change happened and I couldn't find the metric in prometheus, it only appeared as a data point once I zoomed in the particular time it happened.

Since Hashicorp has joined the Cloud Native Foundation it would be great to treat Prometheus as a first class citizen, rewriting these docs: https://www.consul.io/docs/agent/telemetry.html with the name of the prometheus metrics would also be awesome. Or at least have a section for it.

Thanks