Thanos: Feature request: Thanos querier metrics for store reachability/liveness

Created on 11 Aug 2020 · 8Comments · Source: thanos-io/thanos

Thanos, Prometheus and Golang version used:

$ thanos --version
thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
  build user:       root@ee9c796b3048
  build date:       20200622-09:49:32
  go version:       go1.14.2

What happened:
There are no metrics available in Thanos querier to determine if any particular Thanos sidecar node is not reachable or live.
thanos_store_nodes_grpc_connections is available but it only shows connection count to reachable/live nodes, it does not contain value 0 if connection couldn't be established.

What you expected to happen:
There should be some metric available to determine that Thanos sidecar is not reachable.

Full logs to relevant components:
Logs showing that Thanos sidecar is not reachable due to firewall issues:

Aug 11 08:49:22 thanos1 thanos: level=warn ts=2020-08-11T08:49:22.813797202Z caller=storeset.go:429 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from prom1a:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prom1a:10901
Aug 11 08:49:22 thanos1 thanos: level=warn ts=2020-08-11T08:49:22.814251401Z caller=storeset.go:429 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from prom1b:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prom1b:10901

$ curl localhost:19192/metrics | grep -e thanos_store_nodes_grpc_connections -e thanos_status -e thanos_querier_store_apis_dns_provider_results
# HELP thanos_querier_store_apis_dns_provider_results The number of resolved endpoints for each configured address
# TYPE thanos_querier_store_apis_dns_provider_results gauge
thanos_querier_store_apis_dns_provider_results{addr="AAprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="AAprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="BBprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="BBprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="CCprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="CCprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="DDprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="DDprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="prom1a:10901"} 1    # <- these two servers do not
thanos_querier_store_apis_dns_provider_results{addr="prom1b:10901"} 1    # <- appear in thanos_store_nodes_grpc_connections
thanos_querier_store_apis_dns_provider_results{addr="EEprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="EEprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="FFprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="FFprom2:10901"} 1
# HELP thanos_status Represents status (0 indicates failure, 1 indicates success) of the component.
# TYPE thanos_status gauge
thanos_status{check="healthy",component="query"} 1
thanos_status{check="ready",component="query"} 1
# HELP thanos_store_nodes_grpc_connections Number of gRPC connection to Store APIs. Opened connection means healthy store APIs available for Querier.
# TYPE thanos_store_nodes_grpc_connections gauge
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"AAprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"AAprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"BBprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"BBprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"CCprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"CCprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"DDprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"DDprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"EEprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"EEprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"FFprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"FFprom\", replica=\"2\"}",store_type="sidecar"} 1

query feature request / improvement help wanted stale

Source

karlism

👍1

Most helpful comment

We are happy to discuss improvements here. Back then we were opposed to having address label in connections, but I feel like this is the only way really for this. I think we can consider such change now.

What I would see in your alert for current code is literally:

sum(thanos_store_nodes_grpc_connections) < 12

bwplotka on 13 Aug 2020

👍2

All 8 comments

This feature is even more relevant for static configuration of stores (provided via --store or --store.sd-file cli options): failed connections don't even make it to metrics at all.
This feature would greatly help monitoring of and alerting on failing stores.
A work-around for the above post would be locating thanos_querier_store_apis_dns_provider_results without matching thanos_store_nodes_grpc_connections; but it wouldn't work for statically configured stores.

patrungel on 12 Aug 2020

What about asserting certain amount of thanos_store_nodes_grpc_connections for certain external labels? Those are static, right? (: So you can produce alert for those, no?

bwplotka on 13 Aug 2020

Sure, it is possible to write some weird alert rule like:

count(thanos_store_nodes_grpc_connections) by (instance) < count(thanos_querier_store_apis_dns_provider_results) by (instance)

I would guess I could even extract and relabel additional labels (like monitor or addr, which are named differently and also contain slightly different values) if I'd invest more time in it, but I think availability of stores is quite important for querier service to have dedicated metric for it. It would even better if there would be more metrics than 0 or 1, here's an example of prometheus Alertmanager metrics:

prometheus_notifications_errors_total{alertmanager="http://alert1:9093/api/v1/alerts"} 301
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.5"} 0.00158349
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.9"} 0.008920719
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.99"} 0.022303124
prometheus_notifications_latency_seconds_sum{alertmanager="http://alert1:9093/api/v1/alerts"} 435.2289226969988
prometheus_notifications_latency_seconds_count{alertmanager="http://alert1:9093/api/v1/alerts"} 3804
prometheus_notifications_sent_total{alertmanager="http://alert1:9093/api/v1/alerts"} 104598

karlism on 13 Aug 2020

👍1

What I would see in your alert for current code is literally:

sum(thanos_store_nodes_grpc_connections) < 12

bwplotka on 13 Aug 2020

👍2

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 12 Sep 2020

cc @s-urbaniak as we talked about this recently :hugs:

bwplotka on 15 Sep 2020

❤1

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale[bot] on 14 Nov 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

stale[bot] on 28 Nov 2020

Was this page helpful?

0 / 5 - 0 ratings