Output of the info page (if this is a bug)
==============
Agent (v6.0.0)
==============
Status date: 2018-05-15 17:28:44.007680 UTC
Pid: 326
Python Version: 2.7.14
Logs:
Check Runners: 1
Log Level: WARN
Paths
=====
Config File: /etc/datadog-agent/datadog.yaml
conf.d: /etc/datadog-agent/conf.d
checks.d: /etc/datadog-agent/checks.d
Clocks
======
System UTC time: 2018-05-15 17:28:44.007680 UTC
Host Info
=========
bootTime: 2018-05-15 17:21:48.000000 UTC
kernelVersion: 4.4.115-k8s
os: linux
platform: debian
platformFamily: debian
platformVersion: 9.3
procs: 72
uptime: 414
virtualizationRole: guest
virtualizationSystem: xen
Hostnames
=========
ec2-hostname: ip-172-31-99-8.us-west-2.compute.internal
hostname: i-0820f63f3cb2ffdc0
instance-id: i-0820f63f3cb2ffdc0
socket-fqdn: cluster-datadog-w2pc6
socket-hostname: cluster-datadog-w2pc6
=========
Collector
=========
Running Checks
==============
No checks have run yet
========
JMXFetch
========
Initialized checks
==================
no checks
Failed checks
=============
no checks
=========
Forwarder
=========
IntakeV1: 1
RetryQueueSize: 0
Success: 1
API Keys status
===============
https://6-0-0-app.agent.datadoghq.com,*************************4d886: API Key valid
==========
Logs-agent
==========
logs-agent is not running
=========
DogStatsD
=========
Event: 1
kubectl describe pod output
Normal Pulled 3m kubelet, ip-172-31-99-8.us-west-2.compute.internal Successfully pulled image "datadog/agent:6.0.0"
Warning Unhealthy 2m kubelet, ip-172-31-99-8.us-west-2.compute.internal Liveness probe failed: Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 13 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, forwarder, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Error: found 13 unhealthy components
Normal Created 2m (x3 over 3m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Created container
Warning Unhealthy 2m (x4 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Liveness probe failed: Agent health: FAIL
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dockerutil-event-dispatch, dogstatsd-main, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
forwarder
Error: found 1 unhealthy components
Normal Killing 2m (x2 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Killing container with id docker://datadog:Container failed liveness probe.. Container will be killed and recreated.
Normal Pulled 2m (x2 over 2m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Container image "datadog/agent:6.0.0" already present on machine
Normal Started 2m (x3 over 3m) kubelet, ip-172-31-99-8.us-west-2.compute.internal Started container
Describe what happened:
We are running our cluster in AWS using with master nodes as on-demand instances, and worker nodes as spot instances. When we lose a spot node, and a new one rejoins, datadog frequently restarts on the node, about 6 times, until it runs normally again. The reason datadog is restarting is that the liveness probe fails, due to the forwarder being unhealthy.
Describe what you expected:
The forwarder to be healthy when deployed on the fresh node.
Steps to reproduce the issue:
On a kubernetes cluster running 1.8.7, with worker nodes on spot, and an autoscale group for the worker nodes, delete a worker node, and wait for a new node to join the cluster. Once the new node has joined the cluster, monitor the datadog pod scheduled on the node.
Additional environment details (Operating System, Cloud provider, etc):
Cloud Provider: AWS
Kubernetes Version: 1.8.7
Datadog Version: 6.0.0
Here's the output of kubectl get pods:
NAME READY STATUS RESTARTS AGE
cluster-datadog-7t2qj 1/1 Running 45 6d
cluster-datadog-kw9hk 1/1 Running 49 6d
cluster-datadog-pqzrc 1/1 Running 45 6d
cluster-datadog-vsm55 1/1 Running 6 7m
cluster-datadog-vxd2z 1/1 Running 0 14m
cluster-datadog-w84vs 1/1 Running 6 7m
When deleting an established pod, it does not seem to restart once it's created
The pods that have been up for 6 days have been on master nodes, and they seem to restart randomly
Actually it looks like it even does this when a new master node comes up
An init container with sleep 300 seems to help mitigate it
Hi @ihoegen ,
Thanks for reaching out!
We are currently working on this issue. There are different network setups where the forwarder is slow to startup, and then is unhealthy at the end of the expected startup time.
We'll update you with our advances on this end, thank you for your patience!
Best regards
Awesome, thank you. I'll update the team
Hey @ihoegen
A quick question: could you please confirm if you are using a proxy on this agents? If it is the case, we should be shipping a fix for this right now.
Thanks
@antoinepouille how would I be able to check?
@ihoegen If you didn't set it in the config of the agent, you're most likely not using a proxy…
Would you mind opening a support case so that one of our support engineers can help you with more details?
Any outcome on this issue
hey I'm hitting this issue as well.. any update on this issue?
@willyyang @gtrembathcafex I simply updated the timeouts of the liveness probe in the daemon set deployment yaml and I'm not experiencing any issues anymore. I have some very high log-volume applications so I thought that might influence it. In my case I I increased initialDelaySeconds from 15 to 30 and timeoutSeconds from 1 to 5.
Hi all,
This issue is old and applies to agent 6.0.0. Recent occurrences of it are likely unrelated to the original report. We changed the status reporting system to solve the forwarder flakiness a while ago, and moved from an exec based mechanism to an http-based one 5 months ago.
If you have seen flaky health checks less than 5 months ago, please open a separate issue.
If you experience healthchecks that are slow to succeed at init, this is likely due to resource limits being too low. The default 256 Mi memory / 200m cpu are good defaults with only the metrics agent enabled, but if you enable traces and/or logs collection these limits likely slow the agent down, making it start late.
If increasing limits doesn't help, please open a separate issue or support ticket as well.
moved from an exec based mechanism to an http-based one 5 months ago
@hkaj Does this mean the daemonset should not be using probe.sh for the liveness check? If so what should the probe be?
probe.sh should still work, but we have experienced some issues with containerd and exec-based probes (not specific to the agent container), so we have moved to an http-based liveness probe as a precaution: https://github.com/DataDog/datadog-agent/blob/1c117b0c6f256ffdf6512c18ccd8eac6e809b295/Dockerfiles/manifests/agent.yaml#L44-L52
Most helpful comment
@willyyang @gtrembathcafex I simply updated the timeouts of the liveness probe in the daemon set deployment yaml and I'm not experiencing any issues anymore. I have some very high log-volume applications so I thought that might influence it. In my case I I increased initialDelaySeconds from 15 to 30 and timeoutSeconds from 1 to 5.