Datadog-agent: Datadog Agent Forwarder fails liveness probe when new spot instance joins cluster, causing multiple restarts

Created on 15 May 2018 · 14Comments · Source: DataDog/datadog-agent

Output of the info page (if this is a bug)

==============
Agent (v6.0.0)
==============

  Status date: 2018-05-15 17:28:44.007680 UTC
  Pid: 326
  Python Version: 2.7.14
  Logs:
  Check Runners: 1
  Log Level: WARN

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    System UTC time: 2018-05-15 17:28:44.007680 UTC

  Host Info
  =========
    bootTime: 2018-05-15 17:21:48.000000 UTC
    kernelVersion: 4.4.115-k8s
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 9.3
    procs: 72
    uptime: 414
    virtualizationRole: guest
    virtualizationSystem: xen

  Hostnames
  =========
    ec2-hostname: ip-172-31-99-8.us-west-2.compute.internal
    hostname: i-0820f63f3cb2ffdc0
    instance-id: i-0820f63f3cb2ffdc0
    socket-fqdn: cluster-datadog-w2pc6
    socket-hostname: cluster-datadog-w2pc6

=========
Collector
=========

  Running Checks
  ==============
    No checks have run yet

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  IntakeV1: 1
  RetryQueueSize: 0
  Success: 1

  API Keys status
  ===============
    https://6-0-0-app.agent.datadoghq.com,*************************4d886: API Key valid

==========
Logs-agent
==========

  logs-agent is not running

=========
DogStatsD
=========

  Event: 1

kubectl describe pod output

  Normal   Pulled                 3m               kubelet, ip-172-31-99-8.us-west-2.compute.internal  Successfully pulled image "datadog/agent:6.0.0"
  Warning  Unhealthy              2m               kubelet, ip-172-31-99-8.us-west-2.compute.internal  Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 13 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, forwarder, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Error: found 13 unhealthy components
  Normal   Created    2m (x3 over 3m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Created container
  Warning  Unhealthy  2m (x4 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Liveness probe failed: Agent health: FAIL
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
forwarder
Error: found 1 unhealthy components
  Normal  Killing  2m (x2 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Killing container with id docker://datadog:Container failed liveness probe.. Container will be killed and recreated.
  Normal  Pulled   2m (x2 over 2m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Container image "datadog/agent:6.0.0" already present on machine
  Normal  Started  2m (x3 over 3m)  kubelet, ip-172-31-99-8.us-west-2.compute.internal  Started container

Describe what happened:
We are running our cluster in AWS using with master nodes as on-demand instances, and worker nodes as spot instances. When we lose a spot node, and a new one rejoins, datadog frequently restarts on the node, about 6 times, until it runs normally again. The reason datadog is restarting is that the liveness probe fails, due to the forwarder being unhealthy.

Describe what you expected:
The forwarder to be healthy when deployed on the fresh node.

Steps to reproduce the issue:
On a kubernetes cluster running 1.8.7, with worker nodes on spot, and an autoscale group for the worker nodes, delete a worker node, and wait for a new node to join the cluster. Once the new node has joined the cluster, monitor the datadog pod scheduled on the node.

Additional environment details (Operating System, Cloud provider, etc):
Cloud Provider: AWS
Kubernetes Version: 1.8.7
Datadog Version: 6.0.0

Source

ihoegen

Most helpful comment

@willyyang @gtrembathcafex I simply updated the timeouts of the liveness probe in the daemon set deployment yaml and I'm not experiencing any issues anymore. I have some very high log-volume applications so I thought that might influence it. In my case I I increased initialDelaySeconds from 15 to 30 and timeoutSeconds from 1 to 5.

tommydejong-zz on 6 Feb 2019

👍8

All 14 comments

Here's the output of kubectl get pods:

NAME                                          READY     STATUS    RESTARTS   AGE
cluster-datadog-7t2qj                         1/1       Running   45         6d
cluster-datadog-kw9hk                         1/1       Running   49         6d
cluster-datadog-pqzrc                         1/1       Running   45         6d
cluster-datadog-vsm55                         1/1       Running   6          7m
cluster-datadog-vxd2z                         1/1       Running   0          14m
cluster-datadog-w84vs                         1/1       Running   6          7m

When deleting an established pod, it does not seem to restart once it's created

The pods that have been up for 6 days have been on master nodes, and they seem to restart randomly

ihoegen on 17 May 2018

Actually it looks like it even does this when a new master node comes up

ihoegen on 18 May 2018

An init container with sleep 300 seems to help mitigate it

ihoegen on 18 May 2018

Hi @ihoegen ,

Thanks for reaching out!
We are currently working on this issue. There are different network setups where the forwarder is slow to startup, and then is unhealthy at the end of the expected startup time.
We'll update you with our advances on this end, thank you for your patience!

Best regards

antoinepouille on 29 May 2018

Awesome, thank you. I'll update the team

ihoegen on 31 May 2018

Hey @ihoegen

A quick question: could you please confirm if you are using a proxy on this agents? If it is the case, we should be shipping a fix for this right now.

Thanks

antoinepouille on 11 Jun 2018

@antoinepouille how would I be able to check?

ihoegen on 11 Jun 2018

@ihoegen If you didn't set it in the config of the agent, you're most likely not using a proxy…
Would you mind opening a support case so that one of our support engineers can help you with more details?

antoinepouille on 14 Jun 2018

Any outcome on this issue

gtrembathcafex on 18 Dec 2018

hey I'm hitting this issue as well.. any update on this issue?

willyyang on 30 Jan 2019

tommydejong-zz on 6 Feb 2019

👍8

Hi all,

This issue is old and applies to agent 6.0.0. Recent occurrences of it are likely unrelated to the original report. We changed the status reporting system to solve the forwarder flakiness a while ago, and moved from an exec based mechanism to an http-based one 5 months ago.

If you have seen flaky health checks less than 5 months ago, please open a separate issue.

If you experience healthchecks that are slow to succeed at init, this is likely due to resource limits being too low. The default 256 Mi memory / 200m cpu are good defaults with only the metrics agent enabled, but if you enable traces and/or logs collection these limits likely slow the agent down, making it start late.
If increasing limits doesn't help, please open a separate issue or support ticket as well.

hkaj on 18 Apr 2019

moved from an exec based mechanism to an http-based one 5 months ago

@hkaj Does this mean the daemonset should not be using probe.sh for the liveness check? If so what should the probe be?

Jonnymcc on 11 Dec 2019

probe.sh should still work, but we have experienced some issues with containerd and exec-based probes (not specific to the agent container), so we have moved to an http-based liveness probe as a precaution: https://github.com/DataDog/datadog-agent/blob/1c117b0c6f256ffdf6512c18ccd8eac6e809b295/Dockerfiles/manifests/agent.yaml#L44-L52

hkaj on 12 Dec 2019

Was this page helpful?

0 / 5 - 0 ratings