Vault: Telemetry: add prometheus endpoint option

Created on 29 Jun 2017  路  12Comments  路  Source: hashicorp/vault

This is a wishlist request to have an option within vault telemetry to configure an endpoint on vault so that prometheus servers can gather metrics from vault.

core feature-request

Most helpful comment

You can use blackbox for that. So for example in the blackbox.yml you can have
vault_unseal: prober: http timeout: 5s http: valid_status_codes: [200,429] method: GET no_follow_redirects: true fail_if_ssl: false fail_if_not_ssl: false fail_if_matches_regexp: - 'sealed":true'

The valid status codes are 200 and 429, because the standby node replies with a 429 (which is expected) and the active node with a 200

The rule in alertmanager to trigger the alerts:
- alert: Vault_node_sealed expr: probe_success{job="vault_sealed"} != 1 for: 1m labels: severity: xxx annotations:xxx

You can also use statsd-exporter to gather more specific stats and better alerts with expressions like:
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5

Hope it helps.

All 12 comments

This has been discussed previously in #1230 and #1415.

Well, exposing a port with some text is a security concern, then use the push-gateway:

The right course of action there would be to enhance go-metrics to support push-gateway.

The push gateway will probably always be akward:

The Prometheus Pushgateway allows you to push time series from these components to an intermediary job which Prometheus can scrape.

Personally I regard that as an extra moving part which can breakdown. Prometheus actually has some valid points regarding push vs pull: https://prometheus.io/docs/introduction/faq/#why-do-you-pull-rather-than-push?

@jefferai In this https://github.com/hashicorp/vault/pull/1415#issuecomment-240392894 you state:

An authenticated /1/sys/metrics that allows access to go-metrics data wouldn't be bad. The issue with Prometheus is that it requires running network-handling code that we have no control over, and from a security perspective that's not something we wanted to bake into Vault.

Would you be open to a pull request which adds an authenticated /1/sys/metrics endpoint which uses Vault own network-handling code but fetches the metrics internally from go-metrics?

I like the idea of a plain, token-authenticated, HTTP/S endpoint that provides JSON-formatted metrics, agnostic to Prometheus or any other particular solution (similar to Consul)

I'm going to be using vault in a production environment (five nodes per site in HA mode backed by etcd) and will need to trigger alerts if any of the nodes needs to be unsealed.
I already use Prometheus and AlertManager so I'd like to plumb Vault into that infrastructure.
Given the lack of support for Prometheus, what's the 'blessed' alternative to do this?

@andybrown668 its not ideal but you can use a statsd exporter.

https://github.com/prometheus/statsd_exporter

So you have vault push its metrics to the exporter and then have prometheus scrape the metrics from the exporter. Its pretty ugly and makes metric collection significantly more complicated but it does work. It requires sidecaring the exporter on the same host as the vault instance, otherwise host label won't be set properly.

I found that use consul service discovery made this less annoying.

Word of caution: I would not use dogstatsd exporter. If vault cannot connect to the exporter, then vault crashes which means that an exporter becomes a SPOF for vault. I opened a bug against vault and it was closed because from hashicorp's point of view this is working as expected. This problem does not occur with statsd since metrics are exported over UDP.

If you're using influxdata/telegraf, it has a statsD input plugin (act as a statsD server), this way you get system metrics and Vault metrics in one component (vs. Prometheus NodeExporter+statsDExproter)

You can use blackbox for that. So for example in the blackbox.yml you can have
vault_unseal: prober: http timeout: 5s http: valid_status_codes: [200,429] method: GET no_follow_redirects: true fail_if_ssl: false fail_if_not_ssl: false fail_if_matches_regexp: - 'sealed":true'

The valid status codes are 200 and 429, because the standby node replies with a 429 (which is expected) and the active node with a 200

The rule in alertmanager to trigger the alerts:
- alert: Vault_node_sealed expr: probe_success{job="vault_sealed"} != 1 for: 1m labels: severity: xxx annotations:xxx

You can also use statsd-exporter to gather more specific stats and better alerts with expressions like:
expr: sum(increase(vault_core_leadership_lost_count{job="example"}[1h])) > 5

Hope it helps.

Folks, I see that go-metrics library has some support for Prometheus https://github.com/armon/go-metrics/tree/master/prometheus . Can this be used to expose Prometheus metrics as @jefferai mentioned?

as per here; https://coreos.com/tectonic/docs/latest/vault-operator/user/monitoring.html#alerting-rules These metrics do not seem to exist in Vault 1.1.0. Does anyone have any recommendation for alerts outside of these?

Closing this since, apparently, this has been implemented with https://github.com/hashicorp/vault/pull/5308.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sochoa picture sochoa  路  39Comments

TopherGopher picture TopherGopher  路  36Comments

ekristen picture ekristen  路  60Comments

SoMuchToGrok picture SoMuchToGrok  路  66Comments

wpg4665 picture wpg4665  路  39Comments