Grafana pod in CrashLoopBackOff periodically. If I delete the pod, it will start working, but eventually crashes again.
Running Linkerd stable-2.0.0
After a period of time, Grafana fails and I can only get it working again by deleting the pod.
This has happened consistently with multiple clusters in my demo environment.
Looks like the grafana liveliness probe failed:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 5m (x9 over 6m) kubelet, aks-nodepool1-34340261-2 Liveness probe failed: Get http://10.244.3.4:3000/api/health: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I see these errors on the linkerd-proxy container in the grafana pod:
ERR! proxy={server=in listen=0.0.0.0:4143 remote=10.244.2.4:36412} linkerd2_proxy::proxy::http::router service error: an error occurred trying to connect: Connection refused (os error 111)
linkerd check outputkubernetes-api: can initialize the client..................................[ok]
kubernetes-api: can query the Kubernetes API...............................[ok]
kubernetes-api: is running the minimum Kubernetes API version..............[ok]
linkerd-api: control plane namespace exists................................[ok]
linkerd-api: control plane pods are ready..................................[retry] -- The "grafana" pod's "grafana" container is not ready
...etc.
@chzbrgr71 are you running with TLS turned on?
Yes. This cluster is setup with TLS on. I think the other clusters did as well, but not sure.
Hmmm, I've not been able to replicate this yet with:
linkerd install --tls=optionalAny other tips?
I have a potential clue as I am about to demo this in front of a large room. :-)
I just checked and the Grafana pod was CrashLoopBackOff The difference this time is that I am running load tests against my application. Could be a coincidence since Grafana is not being load tested. But I am seeing this consistently.
There appears to be two separate bugs here.
Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 0
memory: 512Mi
Giving more resource to grafana pod seems to work for me. Running with following resource for grafana pod
resources: limits: cpu: 500m memory: 512Mi requests: cpu: 0 memory: 512Mi
I set memory to 1024Mi solved the problem
I see this too on a cluster running linkerd version edge-18.11.2 with TLS enabled, but there are already no resource limits/requests specified for Grafana, so doesn't seem to be resource request related in my case.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.