Vault: Telemetry: add prometheus endpoint option

Created on 29 Jun 2017 · 12Comments · Source: hashicorp/vault

This is a wishlist request to have an option within vault telemetry to configure an endpoint on vault so that prometheus servers can gather metrics from vault.

core feature-request

Source

jpds

👍13

Most helpful comment

You can use blackbox for that. So for example in the blackbox.yml you can have
vault_unseal: prober: http timeout: 5s http: valid_status_codes: [200,429] method: GET no_follow_redirects: true fail_if_ssl: false fail_if_not_ssl: false fail_if_matches_regexp: - 'sealed":true'

The valid status codes are 200 and 429, because the standby node replies with a 429 (which is expected) and the active node with a 200

The rule in alertmanager to trigger the alerts:
- alert: Vault_node_sealed expr: probe_success{job="vault_sealed"} != 1 for: 1m labels: severity: xxx annotations:xxx

You can also use statsd-exporter to gather more specific stats and better alerts with expressions like:
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5

Hope it helps.

leyraroro on 4 May 2018

👍2

All 12 comments

This has been discussed previously in #1230 and #1415.

cosmopetrich on 30 Jun 2017

Well, exposing a port with some text is a security concern, then use the push-gateway:

jpds on 30 Jun 2017

The right course of action there would be to enhance go-metrics to support push-gateway.

jefferai on 1 Jul 2017

The push gateway will probably always be akward:

The Prometheus Pushgateway allows you to push time series from these components to an intermediary job which Prometheus can scrape.

Personally I regard that as an extra moving part which can breakdown. Prometheus actually has some valid points regarding push vs pull: https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push?

@jefferai In this https://github.com/hashicorp/vault/pull/1415#issuecomment-240392894 you state:

An authenticated /1/sys/metrics that allows access to go-metrics data wouldn't be bad. The issue with Prometheus is that it requires running network-handling code that we have no control over, and from a security perspective that's not something we wanted to bake into Vault.

Would you be open to a pull request which adds an authenticated /1/sys/metrics endpoint which uses Vault own network-handling code but fetches the metrics internally from go-metrics?

siepkes on 4 Oct 2017

I like the idea of a plain, token-authenticated, HTTP/S endpoint that provides JSON-formatted metrics, agnostic to Prometheus or any other particular solution (similar to Consul)

jcmcken on 19 Mar 2018

I'm going to be using vault in a production environment (five nodes per site in HA mode backed by etcd) and will need to trigger alerts if any of the nodes needs to be unsealed.
I already use Prometheus and AlertManager so I'd like to plumb Vault into that infrastructure.
Given the lack of support for Prometheus, what's the 'blessed' alternative to do this?

andybrown668 on 30 Mar 2018

@andybrown668 its not ideal but you can use a statsd exporter.

https://github.com/prometheus/statsd_exporter

So you have vault push its metrics to the exporter and then have prometheus scrape the metrics from the exporter. Its pretty ugly and makes metric collection significantly more complicated but it does work. It requires sidecaring the exporter on the same host as the vault instance, otherwise host label won't be set properly.

I found that use consul service discovery made this less annoying.

Word of caution: I would not use dogstatsd exporter. If vault cannot connect to the exporter, then vault crashes which means that an exporter becomes a SPOF for vault. I opened a bug against vault and it was closed because from hashicorp's point of view this is working as expected. This problem does not occur with statsd since metrics are exported over UDP.

jaloren on 30 Mar 2018

If you're using influxdata/telegraf, it has a statsD input plugin (act as a statsD server), this way you get system metrics and Vault metrics in one component (vs. Prometheus NodeExporter+statsDExproter)

ayashjorden on 31 Mar 2018

The valid status codes are 200 and 429, because the standby node replies with a 429 (which is expected) and the active node with a 200

The rule in alertmanager to trigger the alerts:
- alert: Vault_node_sealed expr: probe_success{job="vault_sealed"} != 1 for: 1m labels: severity: xxx annotations:xxx

You can also use statsd-exporter to gather more specific stats and better alerts with expressions like:
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5

Hope it helps.

leyraroro on 4 May 2018

👍2

Folks, I see that go-metrics library has some support for Prometheus https://github.com/armon/go-metrics/tree/master/prometheus . Can this be used to expose Prometheus metrics as @jefferai mentioned?

tamalsaha on 17 Aug 2018

as per here; https://coreos.com/tectonic/docs/latest/vault-operator/user/monitoring.html#alerting-rules These metrics do not seem to exist in Vault 1.1.0. Does anyone have any recommendation for alerts outside of these?