Datadog-agent: Upgrading from 1.5.2 to 1.6.0 results in HTTP 500s in k8s liveness probe

Created on 7 Jul 2020  路  7Comments  路  Source: DataDog/datadog-agent

Output of the info page (if this is a bug)


agent status

$ kubectl exec -it -n datadog datadog-cluster-agent-6d878bb576-pxlhs  agent status
Getting the status from the agent.
==============================
Datadog Cluster Agent (v1.6.0)
==============================

  Status date: 2020-07-07 12:33:38.910956 UTC
  Agent start: <no value>
  Pid: 1
  Go Version: <no value>
  Build arch: <no value>
  Check Runners: 4
  Log Level: WARN

  Paths
  =====
    Config File: /etc/datadog-agent/datadog-cluster.yaml
    conf.d: /etc/datadog-agent/conf.d

  Clocks
  ======
    System UTC time: 2020-07-07 12:33:38.910956 UTC

  Hostnames
  =========
    ec2-hostname: ip-10-100-186-91.ec2.internal
    hostname: i-0062d6365d3cedb39
    instance-id: i-0062d6365d3cedb39
    socket-fqdn: datadog-cluster-agent-6d878bb576-pxlhs
    socket-hostname: datadog-cluster-agent-6d878bb576-pxlhs
    hostname provider: aws
    unused hostname providers:
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: status code 404 trying to GET http://169.254.169.254/computeMetadata/v1/instance/hostname

  Metadata
  ========

Leader Election
===============
  Leader Election Status:  Running
  Leader Name is: datadog-cluster-agent-6d878bb576-pxlhs
  Last Acquisition of the lease: Tue, 07 Jul 2020 12:31:26 UTC
  Renewed leadership: Tue, 07 Jul 2020 12:33:26 UTC
  Number of leader transitions: 42 transitions
 Custom Metrics Server
 =====================
   ConfigMap name: datadog/datadog-custom-metrics

   External Metrics
   ----------------
     Total: 10
     Valid: 10

   ...

Cluster Checks Dispatching
==========================
  Status: Leader, serving requests
  Active nodes: 16
  Check Configurations: 4
    - Dispatched: 4
    - Unassigned: 0

=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubernetes_apiserver.d/conf.yaml.default
      Total Runs: 13
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 61, Total: 1,053
      Service Checks: Last Run: 3, Total: 27
      Average Execution Time : 1.716s
      Last Execution Date : 2020-07-07 12:33:29.000000 UTC
      Last Successful Execution Date : 2020-07-07 12:33:29.000000 UTC

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 12
    Connections: 0
    Containers: 0
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 8
    Metadata: 0
    Pods: 0
    Processes: 0
    RTContainers: 0
    RTProcesses: 0
    Requeued: 0
    Retried: 0
    RetryQueueSize: 0
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 32
    TimeseriesV1: 12

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - 2751d

Describe what happened:
After upgrading cluster agent tag from 1.5.2 to 1.6.0 the newly started agents start responding to liveness probes with HTTP 500 after about 2 minutes after starting.
Nothing suspicious in agent status or logs.

Additional environment details (Operating System, Cloud provider, etc):

  • datadog k8s chart 2.3.26
  • AWS EKS 1.16

Most helpful comment

@imranismail Seems to be fixed on my end too. But the complete lack of communication from @datadog team is concerning.

All 7 comments

I face the same issue. Healthcheck is failing as per cluster-agent logs:

2020-07-10 08:25:19 UTC | CLUSTER | DEBUG | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [healthcheck]

The status of the pod is not ready:

$kubectl -n datadog get pod/datadog-datadog-cluster-agent-5bf4686554-vncxn
NAME                                                 READY   STATUS      RESTARTS   AGE
datadog-datadog-cluster-agent-5bf4686554-vncxn        0/1     Running        0       20m

This is what I see in the pod events

 Warning  Unhealthy  10m (x2 over 11m)   kubelet, ip-10-0-15-223.ap-southeast-2.compute.internal  Liveness probe failed: HTTP probe failed with statuscode: 500
 Warning  Unhealthy  87s (x39 over 10m)  kubelet, ip-10-0-15-223.ap-southeast-2.compute.internal  Readiness probe failed: HTTP probe failed with statuscode: 500

I had the same issue, updating using the values in https://github.com/l0k0ms/charts/blob/master/stable/datadog/values.yaml#L472 helped.

I set the readiness probe to localhost:5000/metrics, and the liveness probe to localhost:5555/live

Not sure if it was exactly the right thing to do, but it worked...

I'm facing this issue aswell

+1

I'm facing a similar issue described that I described in #6046

@mper0003 I saw a similar issue like yours in #5852 . I'm not sure if it's applicable to you but I thought it was worth mentioning.

Upgrading to 1.7.0 fixed the issue for me

@imranismail Seems to be fixed on my end too. But the complete lack of communication from @datadog team is concerning.

Was this page helpful?
0 / 5 - 0 ratings