Charts: [stable/datadog] cluster-agent readiness probe failed

Created on 27 Jun 2020  路  10Comments  路  Source: helm/charts

Describe the bug
When deploying the latest version of this chart, the following error occurs in the cluster-agent deployment.

Readiness probe failed: HTTP probe failed with statuscode: 500

Probe definition:

  readinessProbe:
    httpGet:
      path: /ready
      port: 5555
      scheme: HTTP

Looks like the /ready endpoint returns 500 instead of 200.

Please change values.yaml to use a different path: '/live' (as the liveness probe does) or '/'.
Root '/' or '/live' endpoint returns 200 status code.

Version of Helm and Kubernetes:

$ helm version
version.BuildInfo{Version:"v3.0.1", GitCommit:"7c22ef9ce89e0ebeb7125ba2ebf7d421f3e82ffa", GitTreeState:"clean", GoVersion:"go1.13.4"}

$ kubectl version --short
Client Version: v1.18.4
Server Version: v1.16.8-eks-e16311

Which chart:
stable/datadog

What happened:
Deployment of cluster-agent is broken.

What you expected to happen:
I expect the cluster-agent deployment probes to pass.

How to reproduce it (as minimally and precisely as possible):

run helm3 install datadog stable/datadog
run kubectl get deploy

Anything else we need to know:
I think that's it. Let me know if you have any questions.

Most helpful comment

All 10 comments

I think I'm having this issue as well, I accidentally upgraded the helm chart. Have you been able to recover or work around?

Edit
For me, I was able to get the old version back with the --version flag on the helm installation

  6 helm upgrade --install datadog stable/datadog  \
  7   --namespace logging      \
  8   --version 2.3.2 \

I think I'm having this issue as well, I accidentally upgraded the helm chart. Have you been able to recover or work around?

Edit
For me, I was able to get the old version back with the --version flag on the helm installation

  6 helm upgrade --install datadog stable/datadog  \
  7   --namespace logging      \
  8   --version 2.3.2 \

Great, 2.3.2 works for me as well.
I can confirm that the chart works up to the 2.3.11.
The readiness probe It's using the /metrics endpoint instead of /ready

  readinessProbe:
    httpGet:
      path: /metrics
      port: 5000
      scheme: HTTP

Thanks for the tip! We had the same error in chart 2.3.14, downgrading to 2.3.11 did the trick!

Same issue with 2.3.18 chart :(

Seems like this is fixed by now. Just upgraded to 2.3.35. Image tag 1.7.0 must have carried the fix.

I can confirm the issue it's fixed in chart version 2.3.35.

I am getting the same probe and readiness check failures with 2.3.35 and 2.3.38 :confused: I downgraded to 2.3.11 again and it's working fine. Here are the cluster agent pod's events with 2.3.35:

Events:
  Type     Reason     Age   From                                                       Message
  ----     ------     ----  ----                                                       -------
  Normal   Scheduled  47s   default-scheduler                                          Successfully assigned default/datadog-monitoring-cluster-agent-746db97d4f-f2hvw to gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb
  Normal   Pulled     46s   kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb  Container image "datadog/cluster-agent:1.7.0" already present on machine
  Normal   Created    46s   kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb  Created container cluster-agent
  Normal   Started    45s   kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb  Started container cluster-agent
  Warning  Unhealthy  8s    kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  Unhealthy  2s    kubelet, gke-main-n1-highcpu-64-preemptible-c1a79477-s5sb  Liveness probe failed: HTTP probe failed with statuscode: 500

@jeopard Today I deployed several instances of chart 2.3.35 and 2.3.40 without problems.
Make sure the deployed cluster agent is using in the image tag 1.7.0. That version has the fix.

@jgzurano as you can see from the logs I sent, it is using 1.7.0

Container image "datadog/cluster-agent:1.7.0" already present on machine

I contacted Datadog support, and they said to just ignore the probe and readiness failures. I'll try that and verify we still have our metrics in the Datadog

Was this page helpful?
0 / 5 - 0 ratings