Describe what happened:
When an integration check fails, the execution of probe.sh by the docker healthchecks or kubernetes livenessProbes returns a non zero exit code.
The datadog agent container is then terminated and restarted, potentially resulting in a crash loop.
Describe what you expected:
The datadog agent container should not restart when integration checks fail.
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc):
Workarounds
probe.sh in docker healthchecks or kubernetes livenessProbes./opt/datadog-agent/bin/agent/agent status instead of probe.sh~ This does not resolve the crashloop when a check is failingcheck_runners/DD_CHECK_RUNNERS (#1805)~ This does not resolve the crashloop even for checks timing out (tested the 6.3.0-rc3 docker image).Here is the result of the execution of probe.sh when a check is in error and the datadog agent is in CrashLoopBackOff:
# while true ; do kubectl exec -ti -n monitoring dd-agent-6g9hm ./probe.sh ; sleep 1 ; done
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
[...]
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
Could not reach agent: Get https://localhost:5001/agent/status/health: dial tcp [::1]:5001: getsockopt: connection refused
Make sure the agent is running before requesting the status and contact support if you continue having issues.
Error: Get https://localhost:5001/agent/status/health: dial tcp [::1]:5001: getsockopt: connection refused
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 4 unhealthy components ===
aggregator, dogstatsd-main, forwarder, tagger
Error: found 4 unhealthy components
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 8 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 8 unhealthy components
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 8 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 8 unhealthy components
command terminated with exit code 255
[...]
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Agent health: PASS
=== 13 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
Agent health: FAIL
=== 12 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
collector-queue
Error: found 1 unhealthy components
command terminated with exit code 255
[...]
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
[...]
Probably related to https://github.com/DataDog/datadog-agent/issues/1487 and https://github.com/DataDog/datadog-agent/pull/1805 when the check failure is caused by timeouts.
Looks like the issue is related to falling back on IPv6 [::1] when the datadog agent does not respond fast enough on IPv4 127.0.0.1.
root@dd-agent-d4kwx:/# grep localhost /etc/hosts
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
root@dd-agent-d4kwx:/# curl -ivk https://localhost:5001/agent/status/health
* Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 5001 failed: Connection refused
* Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 5001 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* error setting certificate verify locations, continuing anyway:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
* subject: O=Datadoc, Inc.
* start date: Jun 15 06:47:23 2018 GMT
* expire date: Jun 12 06:47:23 2028 GMT
* issuer: O=Datadoc, Inc.
* SSL certificate verify result: self signed certificate (18), continuing anyway.
> GET /agent/status/health HTTP/1.1
> Host: localhost:5001
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
HTTP/1.1 401 Unauthorized
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
< Www-Authenticate: Bearer realm="Datadog Agent"
Www-Authenticate: Bearer realm="Datadog Agent"
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< Date: Fri, 15 Jun 2018 11:17:24 GMT
Date: Fri, 15 Jun 2018 11:17:24 GMT
< Content-Length: 26
Content-Length: 26
<
no session token provided
* Curl_http_done: called premature == 0
* Connection #0 to host localhost left intact
Hi @pdecat
Thanks for the detailed report, and sorry you're facing this problem.
We're aware of the issue where collector-queue becomes unhealthy in some cases and gets the container killed, but would prefer solving its root cause rather than stop relying on the probe all together.
Can you tell us more about this http check? It's not supposed to make the collector queue unhealthy and I didn't manage to reproduce with the instruction you sent and agent 6.2.1. Actually the easiest option to give us all the information would be to send us a flare: https://docs.datadoghq.com/agent/troubleshooting/#send-a-flare
Thanks again.
Hi @hkaj
I agree not using the probe.sh script is just a workaround.
For the time being, I've exposed the agent's trace port:
- containerPort: 8126
name: traceport
protocol: TCP
and switched to a tcp probe:
livenessProbe:
tcpSocket:
port: traceport
It's not only about http checks, the same goes for tcp checks as long as it times out, for exemple:
tcp_check.yaml: |
init_config:
instances:
- name: somefailingcheck
host: 10.0.0.42
port: 2181
timeout: 30
collect_response_time: true
skip_event: true
@hkaj I've checked internally and it seems related to one of my colleagues case (id #144544).
Can you confirm that?
If so, I'll send a flare to this one.
It could be related yeah, increasing the amount of check runners should help in this specific case, but please send a flare so we can make sure of it.
Flare sent.
Flare received, thanks @pdecat. We'll get back to you as soon as possible.
Please note that with datadog/agent:6.3.0-rc.3 that has 2 check runners by default, more than one TCP check needs to time out to reproduce the issue.
I've added known to time out TCP checks for test purposes and sent a new flare as our original issue had disappeared when I sent the first one.
root@dd-agent-24rlh:/# /probe.sh
Agent health: FAIL
=== 12 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
collector-queue
Error: found 1 unhealthy components
root@dd-agent-24rlh:/# echo $?
255
Actually we increased the amount of runners to mitigate this issue and related ones, so that's good news 馃槃.
We're experimenting with adding more. It can have a significant impact on CPU usage (not on average iirc, but the profile is more spiky), but in containers, with proper limits/quotas this issue should not arise. We also have other improvements in the pipe for this problem.
@hkaj, increasing the number of workers is just a workaround but not a long term solution, isn't it?
Otherwise, what happens when I've got dozens of checks (HTTP, TCP or others) that may all time out at the same time.
I presume the probe will still error unless there are more check runners than blocking checks.
Correct, it's a workaround, and if there are more blocking checks than runners it will still fail.
We plan on increasing the # of check runners to more than 10 (we're still experimenting to find the best compromise here) but we need some improvements around check scheduling before it happens. This part has started and will be available soon.
Another workaround is to reduce your check timeouts to < 15 sec - or whatever you pick for the check run period. This will allow checks to terminate on time, succeeding or otherwise. It's not possible for every check but for TCP/HTTP probes that should be fine.
The long term fix is to change the logic of the collector-queue health check (health check as in internal health check, not the docker health check instruction), but this is waiting for other preliminary work to happen.
This issue has been killing our team as well. We _have_ to use the liveness probe as we are seeing a condition where after the tagger-docker component goes unhealthy it won't heal unless we cycle the container. But the agent containers are cycling too frequently now due to collector-queue going unhealthy, even after raising DD_CHECK_RUNNERS to 16. Really frustrating. (We also have this open as support ticket 150839)
Hi @wpalmeri, when tagger-docker goes unhealthy, what are the symptoms?
I guess we could implement a custom liveness probe to specifically check its status as a workaround until this issue is resolved.
Once we saw tagger-docker get stuck all the kubernetes metrics stopped being reported including the kubelet_check which was the first thing to page us. Kicking the pod would solve the issue until it happened again. The liveness probe is a bandaid, but now we have frequent datadog agent container restarts, which will still result in short windows of missing metrics. Either way we are stuck. (All of this behavior is new in datadog 6 compared to 5.)
Going back over my logs, I've had hundreds, if not thousands of these events in last 2 weeks, this seems like a serious issue with the quality of the v6 agent in a Kubernetes environment. I've applied @pdecat 's workaround and changed the liveness check to a tcpSocket check on the APM port and haven't the rate of liveness probe failures since deploying the change into our kubernetes cluster has dropped dramatically. It remains to be seen how stable it remains and it will be interesting to see how it handles the typically idle load in our cluster over the weekend.
Also seeing the same behavior as reported. We're using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.
The workaround we applied was to set the liveness probe to:
livenessProbe:
exec:
command:
- ./probe.sh
initialDelaySeconds: 30
periodSeconds: 15
And to increase the check runners to 16:
- name: DD_CHECK_RUNNERS
value: "16"
I'll monitor it over the coming days to see if it remains stable.
Just wanted to note this issue remains present today. The suggested workarounds in the prior comment, while certainly not ideal, do seem to provide some relief.
Per the title it does seem to be a mistake to tie the liveness of the datadog pod to the underlying checks configured. One monitored container should not be be able to take down the entire monitoring solution should it begin to timeout.
We are also seeing this in our agents. Any idea if this is planned to be fixed?
Not using probe.sh had a bad side effect of not detecting issues when the agent failed to consume kubernetes events. This happened for us when GKE master nodes were automatically upgraded.
I've upgraded our agent from 6.4.2 to 6.5.2 and reinstated usage of probe.sh in our liveness probes with DD_CHECK_RUNNERS set to 16 for now.
6.5.2 includes a revamp of the check scheduler, which makes healthcheck issues linked to the collector-queue component far less likely.
Can you try to upgrade and see if everything goes better?
(Increasing the number of check runners can still be a workaround if this continue to occur, but default has already been moved to 4 at this version)
having the same issue with datadog/agent:6.8.3. case with datadog support - Request #191616
Facing the same issues. I've increased the DD_CHECK_RUNNERS: 16 and increased the timeout and periods to the liveness probe and still not enough to prevent the agent from restarting constantly.
Has anyone been able to fix this?
I started an experiment last September to get rid of s6-supervise and run the datadog agent processes in separate containers, each with its own probes, but never got to the end of it.
Anyway, here's what it looked like:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: datadog-agent
namespace: monitoring
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 33%
template:
metadata:
labels:
app: datadog-agent
name: datadog-agent
spec:
serviceAccountName: datadog-agent
initContainers:
- name: datadog-agent-init
image: datadog/agent:6.4.2
imagePullPolicy: IfNotPresent
command:
- bash
- -x
- -c
- 'for script in /etc/cont-init.d/* ; do bash -x $script ; done ; cp -r /etc/datadog-agent/* /etc/datadog-agent-init/'
envFrom:
- configMapRef:
name: datadog-agent-env
env:
- name: DD_KUBERNETES_KUBELET_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
volumeMounts:
- name: datadog-agent-config
mountPath: /conf.d
- name: datadog-agent-etc
mountPath: /etc/datadog-agent-init
containers:
- name: datadog-agent
image: datadog/agent:6.4.2
imagePullPolicy: IfNotPresent
command:
- /opt/datadog-agent/bin/agent/agent
- run
livenessProbe:
exec:
command:
- /opt/datadog-agent/bin/agent/agent
- status
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 30
ports:
- containerPort: 8125
name: dogstatsdport
hostPort: 8125
protocol: UDP
envFrom:
- configMapRef:
name: datadog-agent-env
env:
- name: DD_KUBERNETES_KUBELET_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "200m"
volumeMounts:
- name: dockersocket
mountPath: /var/run/docker.sock
- name: procdir
mountPath: /host/proc
readOnly: true
- name: cgroups
mountPath: /host/sys/fs/cgroup
readOnly: true
- name: datadog-agent-config
mountPath: /conf.d
- name: datadog-agent-etc
mountPath: /etc/datadog-agent
- name: datadog-trace-agent
image: datadog/agent:6.4.2
imagePullPolicy: IfNotPresent
command:
- trace-agent
- --config=/etc/datadog-agent/datadog.yaml
ports:
- containerPort: 8126
name: traceport
hostPort: 8126
protocol: TCP
envFrom:
- configMapRef:
name: datadog-trace-agent-env
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "200m"
livenessProbe:
exec:
command:
- trace-agent
- -info
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 30
volumeMounts:
- name: datadog-agent-etc
mountPath: /etc/datadog-agent
- name: datadog-process-agent
image: datadog/agent:6.4.2
imagePullPolicy: IfNotPresent
command:
- process-agent
- -logtostderr
- -config=/etc/datadog-agent/datadog.yaml
envFrom:
- configMapRef:
name: datadog-process-agent-env
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "256Mi"
cpu: "200m"
livenessProbe:
exec:
command:
- process-agent
- -info
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 30
failureThreshold: 5
successThreshold: 1
volumeMounts:
- name: dockersocket
mountPath: /var/run/docker.sock
- name: procdir
mountPath: /host/proc
readOnly: true
- name: cgroups
mountPath: /host/sys/fs/cgroup
readOnly: true
- name: datadog-agent-etc
mountPath: /etc/datadog-agent
volumes:
- name: dockersocket
hostPath:
path: /var/run/docker.sock
- name: procdir
hostPath:
path: /proc
- name: cgroups
hostPath:
path: /sys/fs/cgroup
- name: datadog-agent-etc
emptyDir: {}
- name: datadog-agent-config
configMap:
name: datadog-agent-config
NOTE: this is not tested and not production ready
Still experiencing this issue, in my case on a fairly trivial and unloaded EKS cluster. Crazy churn, log spam and a ton of restarts.
@pdecat With the closure of this issue, is it safe to assume it鈥檚 been resolved in a recent release?
Hi @devillexio,
as per https://github.com/DataDog/datadog-agent/issues/1830#issuecomment-428487318, I've been using the probe.sh for more than a year without issues. I then assumed this could be closed.
Excellent! Thank you for confirming!
Most helpful comment
Also seeing the same behavior as reported. We're using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.
The workaround we applied was to set the liveness probe to:
And to increase the check runners to 16:
I'll monitor it over the coming days to see if it remains stable.