To follow up threat started on slack - https://linkerd.slack.com/archives/C89RTCWJF/p1576060416021400
linkerd check starts failing randomly (I suspect this happens after cluster scaling activities):
× [prometheus] control plane can talk to Prometheus
Error calling Prometheus from the control plane: server_error: server error: 503
see https://linkerd.io/checks/#l5d-api-control-api for hints
➜ k -n linkerd get pod | grep prom
linkerd-prometheus-58cf694fcf-dzckq 2/2 Running 2 110m
➜ linkerd dashboard
Cannot connect to Linkerd: Error calling Prometheus from the control plane: server_error: server error: 503
Validate the install with: linkerd check
➜ linkerd logs --control-plane-component prometheus
+ linkerd linkerd-prometheus-58cf694fcf-dzckq › linkerd-proxy
+ linkerd linkerd-prometheus-58cf694fcf-dzckq › prometheus
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy time="2019-12-11T08:49:50Z" level=info msg="running version stable-2.6.0"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002786s] linkerd2_proxy::app::main using destination service at ControlAddr { addr: Name(NameAddr { name: "linkerd-dst.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local") }
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002819s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 })
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002825s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:331 host_details="(Linux 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 x86_64 linkerd-prometheus-58cf694fcf-dzckq (none))"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002833s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002841s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002844s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 587, 3306}
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.934Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.002851s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 587, 3306}
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.933Z caller=main.go:652 msg="Starting TSDB ..."
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [ 0.017908s] linkerd2_proxy::app::main Certified identity: linkerd-prometheus.linkerd.serviceaccount.identity.linkerd.cluster.local
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:667 fs_type=EXT4_SUPER_MAGIC
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:668 msg="TSDB started"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:738 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.797Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.798Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.799Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.800Z caller=main.go:766 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.800Z caller=main.go:621 msg="Server is ready to receive web requests."
➜ linkerd logs --control-plane-component controller --container public-api
+ linkerd linkerd-controller-786d64fc95-8g8q9 › public-api
+ linkerd linkerd-controller-786d64fc95-j5889 › public-api
+ linkerd linkerd-controller-786d64fc95-q44jt › public-api
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:29:37Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:30:36Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:30:59Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="starting HTTP server on :8085"
k -n linkerd describe deploy/linkerd-prometheus
Containers:
prometheus:
Image: prom/prometheus:v2.11.1
Port: 9090/TCP
Host Port: 0/TCP
Args:
--storage.tsdb.path=/data
--storage.tsdb.retention.time=6h
--config.file=/etc/prometheus/prometheus.yml
--log.level=info
Limits:
cpu: 4
memory: 8Gi
Requests:
cpu: 300m
memory: 300Mi
linkerd-proxy:
Image: gcr.io/linkerd-io/proxy:stable-2.6.0
Ports: 4143/TCP, 4191/TCP
Host Ports: 0/TCP, 0/TCP
Limits:
cpu: 1
memory: 250Mi
Requests:
cpu: 100m
memory: 20Mi
Jumping on the grafana pod:
$ curl http://linkerd-prometheus:9090 -I
HTTP/1.1 405 Method Not Allowed
Allow: GET, OPTIONS
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 12 Dec 2019 08:43:03 GMT
Content-Length: 19
Port forward to svc/linkerd-prometheus also works and I can execute queries.
Eventually lucky pod restart brings back prometheus (sometimes it takes up to 20+ times).
linkerd install flags:
install: |
{"uuid":"926ff43d-fae4-443b-943c-2c8051d2aff4","cliVersion":"stable-2.6.0","flags":[{"name":"ha","value":"true"},{"name":"image-pull-policy","value":"Always"},{"name":"linkerd-cni-enabled","value":"true"}]}
kubernetes 1.14.9, calico 3.7.5, AWS, cluster spun across 3 AZs, cluster-autoscaler in action (using between 10-25 workers), CoreOS stable
@grampelberg
Any updates on this one?
I am experiencing an issue with exactly same symptoms. Is there any workaround available?
We're still unable to replicate. If you have a replication process we can follow step by step, that would be super helpful.
It is actually controller pod which needs a restart, not prometheus.
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.078662s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.282051s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11944.026353s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline
linkerd-controller-7f7cf47fd6-8nt6k public-api time="2020-01-07T14:32:11Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11945.525575s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11947.840922s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }
after killing that pod, everything works.
Bump, still happens with kube v1.15.7 and calico v3.9.4.
Same thing here, cant really tell what triggered that tho
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
I bet we have this fixed now, there was a timing error between liveness checks and prometheus probes. Give the latest edge a try and it should be okay. I'll close it out for now, feel free to reopen if you're still experiencing issues.
This issue is still present. Killing the controller pods work.
Most helpful comment
It is actually controller pod which needs a restart, not prometheus.
linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.078662s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.282051s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11944.026353s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline linkerd-controller-7f7cf47fd6-8nt6k public-api time="2020-01-07T14:32:11Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503" linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11945.525575s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11947.840922s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }after killing that pod, everything works.