Describe the bug
When deploying the latest version of this chart, the following error occurs in the cluster-agent deployment.
Readiness probe failed: HTTP probe failed with statuscode: 500
Probe definition:
readinessProbe:
httpGet:
path: /ready
port: 5555
scheme: HTTP
Looks like the /ready endpoint returns 500 instead of 200.
Please change values.yaml to use a different path: '/live' (as the liveness probe does) or '/'.
Root '/' or '/live' endpoint returns 200 status code.
Version of Helm and Kubernetes:
$ helm version
version.BuildInfo{Version:"v3.0.1", GitCommit:"7c22ef9ce89e0ebeb7125ba2ebf7d421f3e82ffa", GitTreeState:"clean", GoVersion:"go1.13.4"}
$ kubectl version --short
Client Version: v1.18.4
Server Version: v1.16.8-eks-e16311
Which chart:
stable/datadog
What happened:
Deployment of cluster-agent is broken.
What you expected to happen:
I expect the cluster-agent deployment probes to pass.
How to reproduce it (as minimally and precisely as possible):
run helm3 install datadog stable/datadog
run kubectl get deploy
Anything else we need to know:
I think that's it. Let me know if you have any questions.
I think I'm having this issue as well, I accidentally upgraded the helm chart. Have you been able to recover or work around?
Edit
For me, I was able to get the old version back with the --version flag on the helm installation
6 helm upgrade --install datadog stable/datadog \
7 --namespace logging \
8 --version 2.3.2 \
I think I'm having this issue as well, I accidentally upgraded the helm chart. Have you been able to recover or work around?
Edit
For me, I was able to get the old version back with the--versionflag on the helm installation6 helm upgrade --install datadog stable/datadog \ 7 --namespace logging \ 8 --version 2.3.2 \
Great, 2.3.2 works for me as well.
I can confirm that the chart works up to the 2.3.11.
The readiness probe It's using the /metrics endpoint instead of /ready
readinessProbe:
httpGet:
path: /metrics
port: 5000
scheme: HTTP
Thanks for the tip! We had the same error in chart 2.3.14, downgrading to 2.3.11 did the trick!
Same issue with 2.3.18 chart :(
Seems like this is fixed by now. Just upgraded to 2.3.35. Image tag 1.7.0 must have carried the fix.
I can confirm the issue it's fixed in chart version 2.3.35.
I am getting the same probe and readiness check failures with 2.3.35 and 2.3.38 :confused: I downgraded to 2.3.11 again and it's working fine. Here are the cluster agent pod's events with 2.3.35:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 47s default-scheduler Successfully assigned default/datadog-monitoring-cluster-agent-746db97d4f-f2hvw to gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb
Normal Pulled 46s kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb Container image "datadog/cluster-agent:1.7.0" already present on machine
Normal Created 46s kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb Created container cluster-agent
Normal Started 45s kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb Started container cluster-agent
Warning Unhealthy 8s kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 2s kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb Liveness probe failed: HTTP probe failed with statuscode: 500
@jeopard Today I deployed several instances of chart 2.3.35 and 2.3.40 without problems.
Make sure the deployed cluster agent is using in the image tag 1.7.0. That version has the fix.
@jgzurano as you can see from the logs I sent, it is using 1.7.0
Container image "datadog/cluster-agent:1.7.0" already present on machine
I contacted Datadog support, and they said to just ignore the probe and readiness failures. I'll try that and verify we still have our metrics in the Datadog
Most helpful comment
https://github.com/DataDog/datadog-agent/pull/5852