Linkerd2: Grafana pod in CrashLoopBackOff periodically

Created on 9 Oct 2018 · 9Comments · Source: linkerd/linkerd2

Bug Report

What is the issue?

Grafana pod in CrashLoopBackOff periodically. If I delete the pod, it will start working, but eventually crashes again.

How can it be reproduced?

Running Linkerd stable-2.0.0
After a period of time, Grafana fails and I can only get it working again by deleting the pod.
This has happened consistently with multiple clusters in my demo environment.

Logs, error output, etc

Looks like the grafana liveliness probe failed:

Events:
  Type     Reason     Age               From                               Message
  ----     ------     ----              ----                               -------
  Warning  Unhealthy  5m (x9 over 6m)   kubelet, aks-nodepool1-34340261-2  Liveness probe failed: Get http://10.244.3.4:3000/api/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I see these errors on the linkerd-proxy container in the grafana pod:

ERR! proxy={server=in listen=0.0.0.0:4143 remote=10.244.2.4:36412} linkerd2_proxy::proxy::http::router service error: an error occurred trying to connect: Connection refused (os error 111)

`linkerd check` output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[retry] -- The "grafana" pod's "grafana" container is not ready
...etc.

Environment

Kubernetes Version: 1.10.7
Cluster Environment: AKS
Host OS: ubuntu
Linkerd version: stable-2.0.0

Possible solution

Additional context

arecontroller bug wontfix

Source

chzbrgr71

All 9 comments

@chzbrgr71 are you running with TLS turned on?

grampelberg on 9 Oct 2018

Yes. This cluster is setup with TLS on. I think the other clusters did as well, but not sure.

chzbrgr71 on 9 Oct 2018

Hmmm, I've not been able to replicate this yet with:

AKS cluster @ 1.10.7, 1.11.3
linkerd install --tls=optional
emojivoto
Viewing deploy/emoji in grafana

Any other tips?

grampelberg on 10 Oct 2018

I have a potential clue as I am about to demo this in front of a large room. :-)

I just checked and the Grafana pod was CrashLoopBackOff The difference this time is that I am running load tests against my application. Could be a coincidence since Grafana is not being load tested. But I am seeing this consistently.

chzbrgr71 on 10 Oct 2018

There appears to be two separate bugs here.

Grafana crashes (for reasons I've not been able to track down yet).
Linkerd holds the socket but appears to no longer do any forwarding, liveness checks fail continuously and the pod never goes healthy again. See #1762

grampelberg on 12 Oct 2018

👍1

Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod

resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 0
            memory: 512Mi

prasanthj on 28 Oct 2018

👍1

Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod
resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 0
            memory: 512Mi

I set memory to 1024Mi solved the problem

jianghao08 on 5 Nov 2018

I see this too on a cluster running linkerd version edge-18.11.2 with TLS enabled, but there are already no resource limits/requests specified for Grafana, so doesn't seem to be resource request related in my case.

bcorijn on 28 Nov 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.