Datadog-agent: Datadog agent probe.sh should not depend on integration checks status for healthchecks/livenessProbe

Created on 13 Jun 2018  路  29Comments  路  Source: DataDog/datadog-agent

Describe what happened:

When an integration check fails, the execution of probe.sh by the docker healthchecks or kubernetes livenessProbes returns a non zero exit code.
The datadog agent container is then terminated and restarted, potentially resulting in a crash loop.

Describe what you expected:

The datadog agent container should not restart when integration checks fail.

Steps to reproduce the issue:

  • Configure an http check to an url that does not exist or times out.
  • Witness the datadog agent container being restarted in loop

Additional environment details (Operating System, Cloud provider, etc):

  • Official docker image 6.2.1
  • Google Kubernetes Engine 1.8.12.gke0 with Container Optimized OS

Workarounds

  • Do not use probe.sh in docker healthchecks or kubernetes livenessProbes.
  • ~Use /opt/datadog-agent/bin/agent/agent status instead of probe.sh~ This does not resolve the crashloop when a check is failing
  • ~When the check failure is caused by timeouts, increase check_runners/DD_CHECK_RUNNERS (#1805)~ This does not resolve the crashloop even for checks timing out (tested the 6.3.0-rc3 docker image).

Most helpful comment

Also seeing the same behavior as reported. We're using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.

The workaround we applied was to set the liveness probe to:

        livenessProbe:
          exec:
            command:
            - ./probe.sh
          initialDelaySeconds: 30
          periodSeconds: 15

And to increase the check runners to 16:

          - name: DD_CHECK_RUNNERS
            value: "16"

I'll monitor it over the coming days to see if it remains stable.

All 29 comments

Here is the result of the execution of probe.sh when a check is in error and the datadog agent is in CrashLoopBackOff:

# while true ; do kubectl exec -ti -n monitoring dd-agent-6g9hm ./probe.sh ; sleep 1 ; done
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
[...]
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
Could not reach agent: Get https://localhost:5001/agent/status/health: dial tcp [::1]:5001: getsockopt: connection refused
Make sure the agent is running before requesting the status and contact support if you continue having issues.
Error: Get https://localhost:5001/agent/status/health: dial tcp [::1]:5001: getsockopt: connection refused
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 4 unhealthy components ===
aggregator, dogstatsd-main, forwarder, tagger
Error: found 4 unhealthy components
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 8 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 8 unhealthy components
command terminated with exit code 255
Agent health: FAIL
=== 1 healthy components ===
healthcheck
=== 8 unhealthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, tagger, tagger-docker
Error: found 8 unhealthy components
command terminated with exit code 255
[...]
=== 13 healthy components ===                                                                                                                                                                                  
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker     
Agent health: PASS                                                                                                                                                                                             
=== 13 healthy components ===                                                                                                                                                                                  
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, collector-queue, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker     
Agent health: FAIL                                                                                                                                                                                             
=== 12 healthy components ===                                                                                                                                                                                  
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker                      
=== 1 unhealthy components ===                                                                                                                                                                                 
collector-queue                                                                                                                                                                                                
Error: found 1 unhealthy components                                                                                                                                                                            
command terminated with exit code 255                                                                                                                                                                          
[...]
error: unable to upgrade connection: container not found ("dd-agent")
error: unable to upgrade connection: container not found ("dd-agent")
[...]

Probably related to https://github.com/DataDog/datadog-agent/issues/1487 and https://github.com/DataDog/datadog-agent/pull/1805 when the check failure is caused by timeouts.

Looks like the issue is related to falling back on IPv6 [::1] when the datadog agent does not respond fast enough on IPv4 127.0.0.1.

root@dd-agent-d4kwx:/# grep localhost /etc/hosts
127.0.0.1       localhost
::1     localhost ip6-localhost ip6-loopback
root@dd-agent-d4kwx:/# curl -ivk https://localhost:5001/agent/status/health
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 5001 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 5001 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* error setting certificate verify locations, continuing anyway:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: none
* TLSv1.2 (OUT), TLS header, Certificate Status (22):
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Client hello (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: O=Datadoc, Inc.
*  start date: Jun 15 06:47:23 2018 GMT
*  expire date: Jun 12 06:47:23 2028 GMT
*  issuer: O=Datadoc, Inc.
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> GET /agent/status/health HTTP/1.1
> Host: localhost:5001
> User-Agent: curl/7.51.0
> Accept: */*
>
< HTTP/1.1 401 Unauthorized
HTTP/1.1 401 Unauthorized
< Content-Type: text/plain; charset=utf-8
Content-Type: text/plain; charset=utf-8
< Www-Authenticate: Bearer realm="Datadog Agent"
Www-Authenticate: Bearer realm="Datadog Agent"
< X-Content-Type-Options: nosniff
X-Content-Type-Options: nosniff
< Date: Fri, 15 Jun 2018 11:17:24 GMT
Date: Fri, 15 Jun 2018 11:17:24 GMT
< Content-Length: 26
Content-Length: 26

<
no session token provided
* Curl_http_done: called premature == 0
* Connection #0 to host localhost left intact

Hi @pdecat
Thanks for the detailed report, and sorry you're facing this problem.

We're aware of the issue where collector-queue becomes unhealthy in some cases and gets the container killed, but would prefer solving its root cause rather than stop relying on the probe all together.

Can you tell us more about this http check? It's not supposed to make the collector queue unhealthy and I didn't manage to reproduce with the instruction you sent and agent 6.2.1. Actually the easiest option to give us all the information would be to send us a flare: https://docs.datadoghq.com/agent/troubleshooting/#send-a-flare
Thanks again.

Hi @hkaj

I agree not using the probe.sh script is just a workaround.

For the time being, I've exposed the agent's trace port:

          - containerPort: 8126
            name: traceport
            protocol: TCP

and switched to a tcp probe:

        livenessProbe:
          tcpSocket:
            port: traceport

It's not only about http checks, the same goes for tcp checks as long as it times out, for exemple:

      tcp_check.yaml: |
        init_config:

        instances:
          - name: somefailingcheck
            host: 10.0.0.42
            port: 2181
            timeout: 30
            collect_response_time: true
            skip_event: true

@hkaj I've checked internally and it seems related to one of my colleagues case (id #144544).

Can you confirm that?

If so, I'll send a flare to this one.

It could be related yeah, increasing the amount of check runners should help in this specific case, but please send a flare so we can make sure of it.

Flare sent.

Flare received, thanks @pdecat. We'll get back to you as soon as possible.

Please note that with datadog/agent:6.3.0-rc.3 that has 2 check runners by default, more than one TCP check needs to time out to reproduce the issue.

I've added known to time out TCP checks for test purposes and sent a new flare as our original issue had disappeared when I sent the first one.

root@dd-agent-24rlh:/# /probe.sh 
Agent health: FAIL
=== 12 healthy components ===
ad-autoconfig, ad-configresolver, ad-kubeletlistener, aggregator, dogstatsd-main, forwarder, healthcheck, metadata-agent_checks, metadata-host, metadata-resources, tagger, tagger-docker
=== 1 unhealthy components ===
collector-queue
Error: found 1 unhealthy components

root@dd-agent-24rlh:/# echo $?
255

Actually we increased the amount of runners to mitigate this issue and related ones, so that's good news 馃槃.

We're experimenting with adding more. It can have a significant impact on CPU usage (not on average iirc, but the profile is more spiky), but in containers, with proper limits/quotas this issue should not arise. We also have other improvements in the pipe for this problem.

@hkaj, increasing the number of workers is just a workaround but not a long term solution, isn't it?

Otherwise, what happens when I've got dozens of checks (HTTP, TCP or others) that may all time out at the same time.
I presume the probe will still error unless there are more check runners than blocking checks.

Correct, it's a workaround, and if there are more blocking checks than runners it will still fail.
We plan on increasing the # of check runners to more than 10 (we're still experimenting to find the best compromise here) but we need some improvements around check scheduling before it happens. This part has started and will be available soon.

Another workaround is to reduce your check timeouts to < 15 sec - or whatever you pick for the check run period. This will allow checks to terminate on time, succeeding or otherwise. It's not possible for every check but for TCP/HTTP probes that should be fine.

The long term fix is to change the logic of the collector-queue health check (health check as in internal health check, not the docker health check instruction), but this is waiting for other preliminary work to happen.

This issue has been killing our team as well. We _have_ to use the liveness probe as we are seeing a condition where after the tagger-docker component goes unhealthy it won't heal unless we cycle the container. But the agent containers are cycling too frequently now due to collector-queue going unhealthy, even after raising DD_CHECK_RUNNERS to 16. Really frustrating. (We also have this open as support ticket 150839)

Hi @wpalmeri, when tagger-docker goes unhealthy, what are the symptoms?

I guess we could implement a custom liveness probe to specifically check its status as a workaround until this issue is resolved.

Once we saw tagger-docker get stuck all the kubernetes metrics stopped being reported including the kubelet_check which was the first thing to page us. Kicking the pod would solve the issue until it happened again. The liveness probe is a bandaid, but now we have frequent datadog agent container restarts, which will still result in short windows of missing metrics. Either way we are stuck. (All of this behavior is new in datadog 6 compared to 5.)

Going back over my logs, I've had hundreds, if not thousands of these events in last 2 weeks, this seems like a serious issue with the quality of the v6 agent in a Kubernetes environment. I've applied @pdecat 's workaround and changed the liveness check to a tcpSocket check on the APM port and haven't the rate of liveness probe failures since deploying the change into our kubernetes cluster has dropped dramatically. It remains to be seen how stable it remains and it will be interesting to see how it handles the typically idle load in our cluster over the weekend.

Also seeing the same behavior as reported. We're using latest agent docker image, on Google Cloud, Kubernetes version 1.10.4-gke.2.

The workaround we applied was to set the liveness probe to:

        livenessProbe:
          exec:
            command:
            - ./probe.sh
          initialDelaySeconds: 30
          periodSeconds: 15

And to increase the check runners to 16:

          - name: DD_CHECK_RUNNERS
            value: "16"

I'll monitor it over the coming days to see if it remains stable.

Just wanted to note this issue remains present today. The suggested workarounds in the prior comment, while certainly not ideal, do seem to provide some relief.

Per the title it does seem to be a mistake to tie the liveness of the datadog pod to the underlying checks configured. One monitored container should not be be able to take down the entire monitoring solution should it begin to timeout.

We are also seeing this in our agents. Any idea if this is planned to be fixed?

Not using probe.sh had a bad side effect of not detecting issues when the agent failed to consume kubernetes events. This happened for us when GKE master nodes were automatically upgraded.

I've upgraded our agent from 6.4.2 to 6.5.2 and reinstated usage of probe.sh in our liveness probes with DD_CHECK_RUNNERS set to 16 for now.

6.5.2 includes a revamp of the check scheduler, which makes healthcheck issues linked to the collector-queue component far less likely.
Can you try to upgrade and see if everything goes better?

(Increasing the number of check runners can still be a workaround if this continue to occur, but default has already been moved to 4 at this version)

having the same issue with datadog/agent:6.8.3. case with datadog support - Request #191616

Facing the same issues. I've increased the DD_CHECK_RUNNERS: 16 and increased the timeout and periods to the liveness probe and still not enough to prevent the agent from restarting constantly.

Has anyone been able to fix this?

I started an experiment last September to get rid of s6-supervise and run the datadog agent processes in separate containers, each with its own probes, but never got to the end of it.

Anyway, here's what it looked like:

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: datadog-agent
  namespace: monitoring
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 33%

  template:
    metadata:
      labels:
        app: datadog-agent
      name: datadog-agent

    spec:
      serviceAccountName: datadog-agent
      initContainers:
      - name: datadog-agent-init
        image: datadog/agent:6.4.2
        imagePullPolicy: IfNotPresent
        command:
          - bash
          - -x
          - -c
          - 'for script in /etc/cont-init.d/* ; do bash -x $script ; done ; cp -r /etc/datadog-agent/* /etc/datadog-agent-init/'
        envFrom:
        - configMapRef:
            name: datadog-agent-env
        env:
          - name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
        volumeMounts:
          - name: datadog-agent-config
            mountPath: /conf.d
          - name: datadog-agent-etc
            mountPath: /etc/datadog-agent-init

      containers:
      - name: datadog-agent
        image: datadog/agent:6.4.2
        imagePullPolicy: IfNotPresent
        command:
          - /opt/datadog-agent/bin/agent/agent
          - run
        livenessProbe:
          exec:
            command:
              - /opt/datadog-agent/bin/agent/agent
              - status
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 30
        ports:
          - containerPort: 8125
            name: dogstatsdport
            hostPort: 8125
            protocol: UDP
        envFrom:
        - configMapRef:
            name: datadog-agent-env
        env:
          - name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        volumeMounts:
          - name: dockersocket
            mountPath: /var/run/docker.sock
          - name: procdir
            mountPath: /host/proc
            readOnly: true
          - name: cgroups
            mountPath: /host/sys/fs/cgroup
            readOnly: true
          - name: datadog-agent-config
            mountPath: /conf.d
          - name: datadog-agent-etc
            mountPath: /etc/datadog-agent

      - name: datadog-trace-agent
        image: datadog/agent:6.4.2
        imagePullPolicy: IfNotPresent
        command:
          - trace-agent
          - --config=/etc/datadog-agent/datadog.yaml
        ports:
          - containerPort: 8126
            name: traceport
            hostPort: 8126
            protocol: TCP
        envFrom:
        - configMapRef:
            name: datadog-trace-agent-env
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        livenessProbe:
          exec:
            command:
              - trace-agent
              - -info
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 30
        volumeMounts: 
          - name: datadog-agent-etc
            mountPath: /etc/datadog-agent

      - name: datadog-process-agent
        image: datadog/agent:6.4.2
        imagePullPolicy: IfNotPresent
        command:
          - process-agent
          - -logtostderr
          - -config=/etc/datadog-agent/datadog.yaml
        envFrom:
        - configMapRef:
            name: datadog-process-agent-env
        resources:
          requests:
            memory: "256Mi"
            cpu: "200m"
          limits:
            memory: "256Mi"
            cpu: "200m"
        livenessProbe:
          exec:
            command:
              - process-agent
              - -info
          initialDelaySeconds: 15
          periodSeconds: 5
          timeoutSeconds: 30
          failureThreshold: 5
          successThreshold: 1
        volumeMounts:
          - name: dockersocket
            mountPath: /var/run/docker.sock
          - name: procdir
            mountPath: /host/proc
            readOnly: true
          - name: cgroups
            mountPath: /host/sys/fs/cgroup
            readOnly: true
          - name: datadog-agent-etc
            mountPath: /etc/datadog-agent

      volumes:
        - name: dockersocket
          hostPath:
            path: /var/run/docker.sock
        - name: procdir
          hostPath:
            path: /proc
        - name: cgroups
          hostPath:
            path: /sys/fs/cgroup
        - name: datadog-agent-etc
          emptyDir: {}
        - name: datadog-agent-config
          configMap:
            name: datadog-agent-config

NOTE: this is not tested and not production ready

Still experiencing this issue, in my case on a fairly trivial and unloaded EKS cluster. Crazy churn, log spam and a ton of restarts.

@pdecat With the closure of this issue, is it safe to assume it鈥檚 been resolved in a recent release?

Hi @devillexio,

as per https://github.com/DataDog/datadog-agent/issues/1830#issuecomment-428487318, I've been using the probe.sh for more than a year without issues. I then assumed this could be closed.

Excellent! Thank you for confirming!

Was this page helpful?
0 / 5 - 0 ratings