Linkerd2: prometheus throws 503 errors

Created on 12 Dec 2019 · 8Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

To follow up threat started on slack - https://linkerd.slack.com/archives/C89RTCWJF/p1576060416021400

linkerd check starts failing randomly (I suspect this happens after cluster scaling activities):

× [prometheus] control plane can talk to Prometheus
    Error calling Prometheus from the control plane: server_error: server error: 503
    see https://linkerd.io/checks/#l5d-api-control-api for hints

➜  k -n linkerd get pod | grep prom
linkerd-prometheus-58cf694fcf-dzckq       2/2     Running   2          110m
➜  linkerd dashboard
Cannot connect to Linkerd: Error calling Prometheus from the control plane: server_error: server error: 503
Validate the install with: linkerd check

➜ linkerd logs --control-plane-component prometheus
+ linkerd linkerd-prometheus-58cf694fcf-dzckq › linkerd-proxy
+ linkerd linkerd-prometheus-58cf694fcf-dzckq › prometheus
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy time="2019-12-11T08:49:50Z" level=info msg="running version stable-2.6.0"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:329 msg="Starting Prometheus" version="(version=2.11.1, branch=HEAD, revision=e5b22494857deca4b806f74f6e3a6ee30c251763)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002786s] linkerd2_proxy::app::main using destination service at ControlAddr { addr: Name(NameAddr { name: "linkerd-dst.linkerd.svc.cluster.local", port: 8086 }), identity: Some("linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local") }
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002819s] linkerd2_proxy::app::main using identity service at Name(NameAddr { name: "linkerd-identity.linkerd.svc.cluster.local", port: 8080 })
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:330 build_context="(go=go1.12.7, user=root@d94406f2bb6f, date=20190710-13:51:17)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002825s] linkerd2_proxy::app::main routing on V4(127.0.0.1:4140)
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:331 host_details="(Linux 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 x86_64 linkerd-prometheus-58cf694fcf-dzckq (none))"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002833s] linkerd2_proxy::app::main proxying on V4(0.0.0.0:4143) to None
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:332 fd_limits="(soft=1048576, hard=1048576)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002841s] linkerd2_proxy::app::main serving admin endpoint metrics on V4(0.0.0.0:4191)
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.928Z caller=main.go:333 vm_limits="(soft=unlimited, hard=unlimited)"
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002844s] linkerd2_proxy::app::main protocol detection disabled for inbound ports {25, 587, 3306}
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.934Z caller=web.go:448 component=web msg="Start listening for connections" address=0.0.0.0:9090
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.002851s] linkerd2_proxy::app::main protocol detection disabled for outbound ports {25, 587, 3306}
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:34.933Z caller=main.go:652 msg="Starting TSDB ..."
linkerd-prometheus-58cf694fcf-dzckq linkerd-proxy INFO [     0.017908s] linkerd2_proxy::app::main Certified identity: linkerd-prometheus.linkerd.serviceaccount.identity.linkerd.cluster.local
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:667 fs_type=EXT4_SUPER_MAGIC
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:668 msg="TSDB started"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.789Z caller=main.go:738 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.797Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.798Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.799Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.800Z caller=main.go:766 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
linkerd-prometheus-58cf694fcf-dzckq prometheus level=info ts=2019-12-11T08:49:37.800Z caller=main.go:621 msg="Server is ready to receive web requests."
➜ linkerd logs --control-plane-component controller --container public-api
+ linkerd linkerd-controller-786d64fc95-8g8q9 › public-api
+ linkerd linkerd-controller-786d64fc95-j5889 › public-api
+ linkerd linkerd-controller-786d64fc95-q44jt › public-api
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-j5889 public-api time="2019-12-11T08:50:07Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:06Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T08:40:07Z" level=info msg="starting HTTP server on :8085"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:29:37Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:30:36Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-8g8q9 public-api time="2019-12-11T10:30:59Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="running version stable-2.6.0"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="Using cluster domain: cluster.local"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="waiting for caches to sync"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="caches synced"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="starting admin server on :9995"
linkerd-controller-786d64fc95-q44jt public-api time="2019-12-11T08:50:50Z" level=info msg="starting HTTP server on :8085"

k -n linkerd describe deploy/linkerd-prometheus
  Containers:
   prometheus:
    Image:      prom/prometheus:v2.11.1
    Port:       9090/TCP
    Host Port:  0/TCP
    Args:
      --storage.tsdb.path=/data
      --storage.tsdb.retention.time=6h
      --config.file=/etc/prometheus/prometheus.yml
      --log.level=info
    Limits:
      cpu:     4
      memory:  8Gi
    Requests:
      cpu:        300m
      memory:     300Mi
   linkerd-proxy:
    Image:       gcr.io/linkerd-io/proxy:stable-2.6.0
    Ports:       4143/TCP, 4191/TCP
    Host Ports:  0/TCP, 0/TCP
    Limits:
      cpu:     1
      memory:  250Mi
    Requests:
      cpu:      100m
      memory:   20Mi

Jumping on the grafana pod:

$ curl http://linkerd-prometheus:9090 -I
HTTP/1.1 405 Method Not Allowed
Allow: GET, OPTIONS
Content-Type: text/plain; charset=utf-8
X-Content-Type-Options: nosniff
Date: Thu, 12 Dec 2019 08:43:03 GMT
Content-Length: 19

Port forward to svc/linkerd-prometheus also works and I can execute queries.

Eventually lucky pod restart brings back prometheus (sometimes it takes up to 20+ times).

How can it be reproduced?

linkerd install flags:

  install: |
    {"uuid":"926ff43d-fae4-443b-943c-2c8051d2aff4","cliVersion":"stable-2.6.0","flags":[{"name":"ha","value":"true"},{"name":"image-pull-policy","value":"Always"},{"name":"linkerd-cni-enabled","value":"true"}]}

kubernetes 1.14.9, calico 3.7.5, AWS, cluster spun across 3 AZs, cluster-autoscaler in action (using between 10-25 workers), CoreOS stable

@grampelberg

arecontroller bug needmore wontfix

Source

michalschott

👍3

Most helpful comment

It is actually controller pod which needs a restart, not prometheus.

linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.078662s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11942.282051s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11944.026353s] linkerd2_app_core::errors request aborted because it reached the configured dispatch deadline linkerd-controller-7f7cf47fd6-8nt6k public-api time="2020-01-07T14:32:11Z" level=error msg="Query(max(process_start_time_seconds{}) by (pod, namespace)) failed with: server_error: server error: 503" linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11945.525575s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" } linkerd-controller-7f7cf47fd6-8nt6k linkerd-proxy WARN [ 11947.840922s] linkerd2_app_core::profiles error fetching profile: Status { code: Unknown, message: "the request could not be dispatched in a timely fashion" }

after killing that pod, everything works.

michalschott on 7 Jan 2020

👍2

All 8 comments

Any updates on this one?
I am experiencing an issue with exactly same symptoms. Is there any workaround available?

akokshar on 6 Jan 2020

We're still unable to replicate. If you have a replication process we can follow step by step, that would be super helpful.

grampelberg on 6 Jan 2020

It is actually controller pod which needs a restart, not prometheus.

after killing that pod, everything works.

michalschott on 7 Jan 2020

👍2

Bump, still happens with kube v1.15.7 and calico v3.9.4.

michalschott on 23 Jan 2020

Same thing here, cant really tell what triggered that tho

raphaelvigee on 24 Feb 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] on 25 May 2020

I bet we have this fixed now, there was a timing error between liveness checks and prometheus probes. Give the latest edge a try and it should be okay. I'll close it out for now, feel free to reopen if you're still experiencing issues.

grampelberg on 26 May 2020

This issue is still present. Killing the controller pods work.