Output of the info page (if this is a bug)
agent status
$ kubectl exec -it -n datadog datadog-cluster-agent-6d878bb576-pxlhs agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.6.0)
==============================
Status date: 2020-07-07 12:33:38.910956 UTC
Agent start: <no value>
Pid: 1
Go Version: <no value>
Build arch: <no value>
Check Runners: 4
Log Level: WARN
Paths
=====
Config File: /etc/datadog-agent/datadog-cluster.yaml
conf.d: /etc/datadog-agent/conf.d
Clocks
======
System UTC time: 2020-07-07 12:33:38.910956 UTC
Hostnames
=========
ec2-hostname: ip-10-100-186-91.ec2.internal
hostname: i-0062d6365d3cedb39
instance-id: i-0062d6365d3cedb39
socket-fqdn: datadog-cluster-agent-6d878bb576-pxlhs
socket-hostname: datadog-cluster-agent-6d878bb576-pxlhs
hostname provider: aws
unused hostname providers:
configuration/environment: hostname is empty
gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname
Metadata
========
Leader Election
===============
Leader Election Status: Running
Leader Name is: datadog-cluster-agent-6d878bb576-pxlhs
Last Acquisition of the lease: Tue, 07 Jul 2020 12:31:26 UTC
Renewed leadership: Tue, 07 Jul 2020 12:33:26 UTC
Number of leader transitions: 42 transitions
Custom Metrics Server
=====================
ConfigMap name: datadog/datadog-custom-metrics
External Metrics
----------------
Total: 10
Valid: 10
...
Cluster Checks Dispatching
==========================
Status: Leader, serving requests
Active nodes: 16
Check Configurations: 4
- Dispatched: 4
- Unassigned: 0
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
Total Runs: 13
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 61, Total: 1,053
Service Checks: Last Run: 3, Total: 27
Average Execution Time : 1.716s
Last Execution Date : 2020-07-07 12:33:29.000000 UTC
Last Successful Execution Date : 2020-07-07 12:33:29.000000 UTC
=========
Forwarder
=========
Transactions
============
CheckRunsV1: 12
Connections: 0
Containers: 0
Dropped: 0
DroppedOnInput: 0
Events: 0
HostMetadata: 0
IntakeV1: 8
Metadata: 0
Pods: 0
Processes: 0
RTContainers: 0
RTProcesses: 0
Requeued: 0
Retried: 0
RetryQueueSize: 0
Series: 0
ServiceChecks: 0
SketchSeries: 0
Success: 32
TimeseriesV1: 12
==========
Endpoints
==========
https://app.datadoghq.com - API Key ending with:
- 2751d
Describe what happened:
After upgrading cluster agent tag from 1.5.2 to 1.6.0 the newly started agents start responding to liveness probes with HTTP 500 after about 2 minutes after starting.
Nothing suspicious in agent status or logs.
Additional environment details (Operating System, Cloud provider, etc):
I face the same issue. Healthcheck is failing as per cluster-agent logs:
2020-07-10 08:25:19 UTC | CLUSTER | DEBUG | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [healthcheck]
The status of the pod is not ready:
$kubectl -n datadog get pod/datadog-datadog-cluster-agent-5bf4686554-vncxn
NAME READY STATUS RESTARTS AGE
datadog-datadog-cluster-agent-5bf4686554-vncxn 0/1 Running 0 20m
This is what I see in the pod events
Warning Unhealthy 10m (x2 over 11m) kubelet, ip-10-0-15-223.ap-southeast-2.compute.internal Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 87s (x39 over 10m) kubelet, ip-10-0-15-223.ap-southeast-2.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 500
I had the same issue, updating using the values in https://github.com/l0k0ms/charts/blob/master/stable/datadog/values.yaml#L472 helped.
I set the readiness probe to localhost:5000/metrics, and the liveness probe to localhost:5555/live
Not sure if it was exactly the right thing to do, but it worked...
I'm facing this issue aswell
+1
I'm facing a similar issue described that I described in #6046
@mper0003 I saw a similar issue like yours in #5852 . I'm not sure if it's applicable to you but I thought it was worth mentioning.
Upgrading to 1.7.0 fixed the issue for me
@imranismail Seems to be fixed on my end too. But the complete lack of communication from @datadog team is concerning.
Most helpful comment
@imranismail Seems to be fixed on my end too. But the complete lack of communication from @datadog team is concerning.