Linkerd2: Grafana pod in CrashLoopBackOff periodically

Created on 9 Oct 2018  路  9Comments  路  Source: linkerd/linkerd2

Bug Report

What is the issue?

Grafana pod in CrashLoopBackOff periodically. If I delete the pod, it will start working, but eventually crashes again.

How can it be reproduced?

Running Linkerd stable-2.0.0
After a period of time, Grafana fails and I can only get it working again by deleting the pod.
This has happened consistently with multiple clusters in my demo environment.

Logs, error output, etc

Looks like the grafana liveliness probe failed:

Events:
  Type     Reason     Age               From                               Message
  ----     ------     ----              ----                               -------
  Warning  Unhealthy  5m (x9 over 6m)   kubelet, aks-nodepool1-34340261-2  Liveness probe failed: Get http://10.244.3.4:3000/api/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I see these errors on the linkerd-proxy container in the grafana pod:

ERR! proxy={server=in listen=0.0.0.0:4143 remote=10.244.2.4:36412} linkerd2_proxy::proxy::http::router service error: an error occurred trying to connect: Connection refused (os error 111)

linkerd check output

kubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[retry] -- The "grafana" pod's "grafana" container is not ready
...etc.

Environment

  • Kubernetes Version: 1.10.7
  • Cluster Environment: AKS
  • Host OS: ubuntu
  • Linkerd version: stable-2.0.0

Possible solution

Additional context

arecontroller bug wontfix

All 9 comments

@chzbrgr71 are you running with TLS turned on?

Yes. This cluster is setup with TLS on. I think the other clusters did as well, but not sure.

Hmmm, I've not been able to replicate this yet with:

  • AKS cluster @ 1.10.7, 1.11.3
  • linkerd install --tls=optional
  • emojivoto
  • Viewing deploy/emoji in grafana

Any other tips?

I have a potential clue as I am about to demo this in front of a large room. :-)

I just checked and the Grafana pod was CrashLoopBackOff The difference this time is that I am running load tests against my application. Could be a coincidence since Grafana is not being load tested. But I am seeing this consistently.

There appears to be two separate bugs here.

  • Grafana crashes (for reasons I've not been able to track down yet).
  • Linkerd holds the socket but appears to no longer do any forwarding, liveness checks fail continuously and the pod never goes healthy again. See #1762

Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod

resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 0
            memory: 512Mi

Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod

resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 0
            memory: 512Mi

I set memory to 1024Mi solved the problem

I see this too on a cluster running linkerd version edge-18.11.2 with TLS enabled, but there are already no resource limits/requests specified for Grafana, so doesn't seem to be resource request related in my case.

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

Was this page helpful?
0 / 5 - 0 ratings