Thanos, Prometheus and Golang version used: v0.11.0
Object Storage Provider: GCS
What happened:
Specified Store and Sidecar storage endpoints on the command command line using DNS resolver. The Sidecar endpoint was incorrect. Query logged the addition of the Store endpoint, but nothing about the Sidecar endpoint. The Query /stores page shows only the Store endpoint.
--store.sd-dns-resolver=miekgdns
--store=dnssrv+_grpc._tcp.thanos-store-grpc.default.svc.cluster.local
--store=dnssrv+_grpc._tcp.thanos-sidecar-grpc.default.svc.cluster.local
What you expected to happen:
Query logs an error regarding DNS resolution of the bad endpoint. The /stores page provides information about the unresolved endpoint.
Agree, good point. :+1: Marking as bug, help wanted to fix it :hugs:
Shall I go about solving this issue?
yes please!
On Mon, 20 Apr 2020 at 10:54, Yash Sharma notifications@github.com wrote:
Shall I go about solving this issue?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/thanos-io/thanos/issues/2404#issuecomment-616440291,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABVA3O3WDKHRVEZSODKFONDRNQLWJANCNFSM4MENVUVA
.
Great! I will hop on! :stuck_out_tongue:
Hey @bwplotka! I was going through the issue, and after reproducing it, I observed the following -
Nice, so if it already logs an error.. what is the problem we are trying to solve then? (:
So the error raised was from the resolver.go file, where it does the resolution of dns, but somehow that error is not propagated to the query component. So I think we need to propagate the error, as I didn't see any errors raised in the logs of query component.
What do you mean no propagate? There is literally level.Error(p.logger).Log("msg", "dns resolution failed", "addr", addr, "err", err) log line :thinking:
Yeah, when I ran the query component in my local machine, it did raise the error, but the same does not happen when I check the logs of query component in Kubernetes deployment.
Let me attach the log.
Nice! Maybe logger is not passed properly?
I am attaching some info about the investigation that I did :stuck_out_tongue:
Config details passed to thanos query
thanos query \
--grpc-address=0.0.0.0:10901 \
--http-address=0.0.0.0:9090 \
--query.replica-label=prometheus_replica \
--query.replica-label=rule_replica \
--store.sd-dns-resolver=miekgdns \
--store=dnssrv+_grpc._tcp.thanos-store.thanos.svc.cluster.local \
--store=dnssrv+_grpc._tcp.prometheus-3-service.monitoring.svc.cluster.local \
Here is the log -
level=info ts=2020-04-25T09:31:55.41307044Z caller=main.go:152 msg="Tracing will be disabled"
level=info ts=2020-04-25T09:31:55.457862986Z caller=options.go:23 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-04-25T09:31:55.458943584Z caller=query.go:401 msg="starting query node"
level=info ts=2020-04-25T09:31:55.45948314Z caller=intrumentation.go:48 msg="changing probe status" status=ready
level=info ts=2020-04-25T09:31:55.460033534Z caller=intrumentation.go:60 msg="changing probe status" status=healthy
level=info ts=2020-04-25T09:31:55.460226236Z caller=http.go:56 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2020-04-25T09:31:55.460227986Z caller=grpc.go:106 service=gRPC/server component=query msg="listening for StoreAPI gRPC" address=0.0.0.0:10901
level=info ts=2020-04-25T09:33:25.58191222Z caller=storeset.go:384 component=storeset msg="adding new storeAPI to query storeset" address=172.17.0.13:10901 extLset=
And here is the pods that I have deployed in minikube -
yash@kmaster î‚° kube-prome/prome-thanos î‚° sudo kubectl get po --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-66bff467f8-8zw8l 1/1 Running 14 18h
kube-system coredns-66bff467f8-sh9v9 1/1 Running 10 18h
kube-system etcd-kmaster 1/1 Running 5 18h
kube-system kube-apiserver-kmaster 1/1 Running 6 18h
kube-system kube-controller-manager-kmaster 1/1 Running 3 7h26m
kube-system kube-proxy-8zcsp 1/1 Running 2 18h
kube-system kube-scheduler-kmaster 1/1 Running 4 7h26m
kube-system storage-provisioner 1/1 Running 3 18h
kubernetes-dashboard dashboard-metrics-scraper-84bfdf55ff-8268b 1/1 Running 2 18h
kubernetes-dashboard kubernetes-dashboard-bc446cc64-cqdsn 1/1 Running 7 18h
monitoring alertmanager-5f7f948969-jvgbb 1/1 Running 1 7h12m
monitoring minio-2-7d5765f59c-56299 1/1 Running 1 7h13m
monitoring prometheus-0 2/2 Running 3 7h11m
monitoring prometheus-1 2/2 Running 3 7h8m
monitoring prometheus-2 2/2 Running 3 7h6m
thanos minio-85fd55b9fd-6t2wp 1/1 Running 0 50s
thanos thanos-query-77d797f89d-vj4v8 1/1 Running 0 33s
thanos thanos-store-0 0/1 Running 1 30s
As we can see that prometheus-3-service is not present, the query somehow skips the error.
Maybe logger is not passed properly?
I think that might be the reason, I am reading through the codebase now, and would comment my understanding of the possible issue :sweat_smile:
Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.
Closing for now as promised, let us know if you need this to be reopened! 🤗