Thanos: Query does not log undiscoverable store

Created on 9 Apr 2020  Â·  14Comments  Â·  Source: thanos-io/thanos

Thanos, Prometheus and Golang version used: v0.11.0
Object Storage Provider: GCS

What happened:
Specified Store and Sidecar storage endpoints on the command command line using DNS resolver. The Sidecar endpoint was incorrect. Query logged the addition of the Store endpoint, but nothing about the Sidecar endpoint. The Query /stores page shows only the Store endpoint.

        --store.sd-dns-resolver=miekgdns
        --store=dnssrv+_grpc._tcp.thanos-store-grpc.default.svc.cluster.local
        --store=dnssrv+_grpc._tcp.thanos-sidecar-grpc.default.svc.cluster.local

What you expected to happen:
Query logs an error regarding DNS resolution of the bad endpoint. The /stores page provides information about the unresolved endpoint.

bug query easy help wanted stale

All 14 comments

Agree, good point. :+1: Marking as bug, help wanted to fix it :hugs:

Shall I go about solving this issue?

yes please!

On Mon, 20 Apr 2020 at 10:54, Yash Sharma notifications@github.com wrote:

Shall I go about solving this issue?

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/2404#issuecomment-616440291,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O3WDKHRVEZSODKFONDRNQLWJANCNFSM4MENVUVA
.

Great! I will hop on! :stuck_out_tongue:

Hey @bwplotka! I was going through the issue, and after reproducing it, I observed the following -

  • The DNS resolution raises an error(here) where it seems to log the error and move on with other dns resolution.
    I was thinking of passing the error to this function where if it detects an error, it logs an error from the Query component.
    What do you think about the approach?

Nice, so if it already logs an error.. what is the problem we are trying to solve then? (:

So the error raised was from the resolver.go file, where it does the resolution of dns, but somehow that error is not propagated to the query component. So I think we need to propagate the error, as I didn't see any errors raised in the logs of query component.

What do you mean no propagate? There is literally level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err) log line :thinking:

Yeah, when I ran the query component in my local machine, it did raise the error, but the same does not happen when I check the logs of query component in Kubernetes deployment.

Let me attach the log.

Nice! Maybe logger is not passed properly?

I am attaching some info about the investigation that I did :stuck_out_tongue:

Config details passed to thanos query

thanos query \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:9090 \
--query.replica-label=prometheus_replica \
--query.replica-label=rule_replica \
--store.sd-dns-resolver=miekgdns \
--store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local \
--store=dnssrv+_grpc._tcp.prometheus-3-service.monitoring.svc.cluster.local \

Here is the log -


level=info ts=2020-04-25T09:31:55.41307044Z caller=main.go:152 msg="Tracing will be disabled"
level=info ts=2020-04-25T09:31:55.457862986Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-04-25T09:31:55.458943584Z caller=query.go:401 msg="starting query node"
level=info ts=2020-04-25T09:31:55.45948314Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-04-25T09:31:55.460033534Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2020-04-25T09:31:55.460226236Z caller=http.go:56 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2020-04-25T09:31:55.460227986Z caller=grpc.go:106 service=gRPC/server component=query msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2020-04-25T09:33:25.58191222Z caller=storeset.go:384 component=storeset msg="adding new storeAPI to query storeset" address=172.17.0.13:10901 extLset=

And here is the pods that I have deployed in minikube -

 yash@kmaster î‚° kube-prome/prome-thanos î‚° sudo kubectl get po --all-namespaces
NAMESPACE              NAME                                         READY   STATUS    RESTARTS   AGE
kube-system            coredns-66bff467f8-8zw8l                     1/1     Running   14         18h
kube-system            coredns-66bff467f8-sh9v9                     1/1     Running   10         18h
kube-system            etcd-kmaster                                 1/1     Running   5          18h
kube-system            kube-apiserver-kmaster                       1/1     Running   6          18h
kube-system            kube-controller-manager-kmaster              1/1     Running   3          7h26m
kube-system            kube-proxy-8zcsp                             1/1     Running   2          18h
kube-system            kube-scheduler-kmaster                       1/1     Running   4          7h26m
kube-system            storage-provisioner                          1/1     Running   3          18h
kubernetes-dashboard   dashboard-metrics-scraper-84bfdf55ff-8268b   1/1     Running   2          18h
kubernetes-dashboard   kubernetes-dashboard-bc446cc64-cqdsn         1/1     Running   7          18h
monitoring             alertmanager-5f7f948969-jvgbb                1/1     Running   1          7h12m
monitoring             minio-2-7d5765f59c-56299                     1/1     Running   1          7h13m
monitoring             prometheus-0                                 2/2     Running   3          7h11m
monitoring             prometheus-1                                 2/2     Running   3          7h8m
monitoring             prometheus-2                                 2/2     Running   3          7h6m
thanos                 minio-85fd55b9fd-6t2wp                       1/1     Running   0          50s
thanos                 thanos-query-77d797f89d-vj4v8                1/1     Running   0          33s
thanos                 thanos-store-0                               0/1     Running   1          30s

As we can see that prometheus-3-service is not present, the query somehow skips the error.

Maybe logger is not passed properly?

I think that might be the reason, I am reading through the codebase now, and would comment my understanding of the possible issue :sweat_smile:

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

Closing for now as promised, let us know if you need this to be reopened! 🤗

Was this page helpful?
0 / 5 - 0 ratings

Related issues

hedeesaa picture hedeesaa  Â·  3Comments

barryib picture barryib  Â·  4Comments

jdfalk picture jdfalk  Â·  3Comments

bwplotka picture bwplotka  Â·  4Comments

bwplotka picture bwplotka  Â·  4Comments