Thanos, Prometheus and Golang version used:
$ thanos --version
thanos, version 0.13.0 (branch: HEAD, revision: adf6facb8d6bf44097aae084ec091ac3febd9eb8)
build user: root@ee9c796b3048
build date: 20200622-09:49:32
go version: go1.14.2
What happened:
There are no metrics available in Thanos querier to determine if any particular Thanos sidecar node is not reachable or live.
thanos_store_nodes_grpc_connections is available but it only shows connection count to reachable/live nodes, it does not contain value 0 if connection couldn't be established.
What you expected to happen:
There should be some metric available to determine that Thanos sidecar is not reachable.
Full logs to relevant components:
Logs showing that Thanos sidecar is not reachable due to firewall issues:
Aug 11 08:49:22 thanos1 thanos: level=warn ts=2020-08-11T08:49:22.813797202Z caller=storeset.go:429 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from prom1a:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prom1a:10901
Aug 11 08:49:22 thanos1 thanos: level=warn ts=2020-08-11T08:49:22.814251401Z caller=storeset.go:429 component=storeset msg="update of store node failed" err="getting metadata: fetching store info from prom1b:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=prom1b:10901
$ curl localhost:19192/metrics | grep -e thanos_store_nodes_grpc_connections -e thanos_status -e thanos_querier_store_apis_dns_provider_results
# HELP thanos_querier_store_apis_dns_provider_results The number of resolved endpoints for each configured address
# TYPE thanos_querier_store_apis_dns_provider_results gauge
thanos_querier_store_apis_dns_provider_results{addr="AAprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="AAprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="BBprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="BBprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="CCprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="CCprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="DDprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="DDprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="prom1a:10901"} 1 # <- these two servers do not
thanos_querier_store_apis_dns_provider_results{addr="prom1b:10901"} 1 # <- appear in thanos_store_nodes_grpc_connections
thanos_querier_store_apis_dns_provider_results{addr="EEprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="EEprom2:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="FFprom1:10901"} 1
thanos_querier_store_apis_dns_provider_results{addr="FFprom2:10901"} 1
# HELP thanos_status Represents status (0 indicates failure, 1 indicates success) of the component.
# TYPE thanos_status gauge
thanos_status{check="healthy",component="query"} 1
thanos_status{check="ready",component="query"} 1
# HELP thanos_store_nodes_grpc_connections Number of gRPC connection to Store APIs. Opened connection means healthy store APIs available for Querier.
# TYPE thanos_store_nodes_grpc_connections gauge
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"AAprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"AAprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"BBprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"BBprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"CCprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"CCprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"DDprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"DDprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"EEprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"EEprom\", replica=\"2\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"FFprom\", replica=\"1\"}",store_type="sidecar"} 1
thanos_store_nodes_grpc_connections{external_labels="{monitor=\"FFprom\", replica=\"2\"}",store_type="sidecar"} 1
This feature is even more relevant for static configuration of stores (provided via --store or --store.sd-file cli options): failed connections don't even make it to metrics at all.
This feature would greatly help monitoring of and alerting on failing stores.
A work-around for the above post would be locating thanos_querier_store_apis_dns_provider_results without matching thanos_store_nodes_grpc_connections; but it wouldn't work for statically configured stores.
What about asserting certain amount of thanos_store_nodes_grpc_connections for certain external labels? Those are static, right? (: So you can produce alert for those, no?
Sure, it is possible to write some weird alert rule like:
count(thanos_store_nodes_grpc_connections) by (instance) < count(thanos_querier_store_apis_dns_provider_results) by (instance)
I would guess I could even extract and relabel additional labels (like monitor or addr, which are named differently and also contain slightly different values) if I'd invest more time in it, but I think availability of stores is quite important for querier service to have dedicated metric for it. It would even better if there would be more metrics than 0 or 1, here's an example of prometheus Alertmanager metrics:
prometheus_notifications_errors_total{alertmanager="http://alert1:9093/api/v1/alerts"} 301
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.5"} 0.00158349
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.9"} 0.008920719
prometheus_notifications_latency_seconds{alertmanager="http://alert1:9093/api/v1/alerts",quantile="0.99"} 0.022303124
prometheus_notifications_latency_seconds_sum{alertmanager="http://alert1:9093/api/v1/alerts"} 435.2289226969988
prometheus_notifications_latency_seconds_count{alertmanager="http://alert1:9093/api/v1/alerts"} 3804
prometheus_notifications_sent_total{alertmanager="http://alert1:9093/api/v1/alerts"} 104598
We are happy to discuss improvements here. Back then we were opposed to having address label in connections, but I feel like this is the only way really for this. I think we can consider such change now.
What I would see in your alert for current code is literally:
sum(thanos_store_nodes_grpc_connections) < 12
(:
Hello 馃憢 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
cc @s-urbaniak as we talked about this recently :hugs:
Hello 馃憢 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 馃
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Closing for now as promised, let us know if you need this to be reopened! 馃
Most helpful comment
We are happy to discuss improvements here. Back then we were opposed to having
addresslabel in connections, but I feel like this is the only way really for this. I think we can consider such change now.What I would see in your alert for current code is literally:
(: