Datadog-agent: Datadog kubernetes agent crashlooping in minikube

Created on 20 Jan 2020  路  6Comments  路  Source: DataDog/datadog-agent

I'm running into an issue where the agent is crashlooping - see the agent info and python stack trace below.

Note that I'm running this all in Tilt.

Output of the info page (if this is a bug)

$ kubectl exec -it datadog-kube-agent-5mrlz agent status
Getting the status from the agent.

===============
Agent (v7.16.1)
===============

  Status date: 2020-01-20 17:23:36.650863 UTC
  Agent start: 2020-01-20 17:22:30.318147 UTC
  Pid: 357
  Go Version: go1.12.9
  Python Version: 3.7.4
  Build arch: amd64
  Check Runners: 4
  Log Level: INFO

  Paths
  =====
    Config File: /etc/datadog-agent/datadog.yaml
    conf.d: /etc/datadog-agent/conf.d
    checks.d: /etc/datadog-agent/checks.d

  Clocks
  ======
    NTP offset: -67.617ms
    System UTC time: 2020-01-20 17:23:36.650863 UTC

  Host Info
  =========
    bootTime: 2020-01-18 01:04:45.000000 UTC
    kernelVersion: 4.19.81
    os: linux
    platform: debian
    platformFamily: debian
    platformVersion: 10.2
    procs: 70
    uptime: 64h17m47s

  Hostnames
  =========
    hostname: minikube
    socket-fqdn: datadog-kube-agent-5mrlz
    socket-hostname: datadog-kube-agent-5mrlz
    hostname provider: container
    unused hostname providers:
      aws: not retrieving hostname from AWS: the host is not an ECS instance, and other providers already retrieve non-default hostnames
      configuration/environment: hostname is empty
      gce: unable to retrieve hostname from GCE: Get http://169.254.169.254/computeMetadata/v1/instance/hostname: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

  Metadata
  ========
    hostname_source: container

=========
Collector
=========

  Running Checks
  ==============

    cpu
    ---
      Instance ID: cpu [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/cpu.d/conf.yaml.default
      Total Runs: 5
      Metric Samples: Last Run: 6, Total: 24
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    disk (2.5.3)
    ------------
      Instance ID: disk:e5dffb8bef24336f [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/disk.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 190, Total: 760
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 27ms


    docker
    ------
      Instance ID: docker [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/docker.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 792, Total: 3,168
      Events: Last Run: 0, Total: 1
      Service Checks: Last Run: 1, Total: 4
      Average Execution Time : 182ms


    file_handle
    -----------
      Instance ID: file_handle [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/file_handle.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 5, Total: 20
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    io
    --
      Instance ID: io [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/io.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 39, Total: 129
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    kubelet (3.4.0)
    ---------------
      Instance ID: kubelet:d884b5186b651429 [ERROR]
      Configuration Source: file:/etc/datadog-agent/conf.d/kubelet.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s
      Error: Unable to detect the kubelet URL automatically.
      Traceback (most recent call last):
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/base.py", line 678, in run
          self.check(instance)
        File "/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/kubelet/kubelet.py", line 184, in check
          raise CheckException("Unable to detect the kubelet URL automatically.")
      datadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.

    load
    ----
      Instance ID: load [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/load.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 6, Total: 24
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    memory
    ------
      Instance ID: memory [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/memory.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 17, Total: 68
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s


    network (1.12.2)
    ----------------
      Instance ID: network:e0204ad63d43c949 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/network.d/conf.yaml.default
      Total Runs: 4
      Metric Samples: Last Run: 31, Total: 124
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 1ms


    ntp
    ---
      Instance ID: ntp:d884b5186b651429 [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/ntp.d/conf.yaml.default
      Total Runs: 1
      Metric Samples: Last Run: 1, Total: 1
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 1
      Average Execution Time : 175ms


    uptime
    ------
      Instance ID: uptime [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/uptime.d/conf.yaml.default
      Total Runs: 5
      Metric Samples: Last Run: 1, Total: 5
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

========
JMXFetch
========

  Initialized checks
  ==================
    no checks

  Failed checks
  =============
    no checks

=========
Forwarder
=========

  Transactions
  ============
    CheckRunsV1: 4
    Dropped: 0
    DroppedOnInput: 0
    Events: 0
    HostMetadata: 0
    IntakeV1: 3
    Metadata: 0
    Requeued: 6
    Retried: 3
    RetryQueueSize: 1
    Series: 0
    ServiceChecks: 0
    SketchSeries: 0
    Success: 0
    TimeseriesV1: 4

  Transaction Errors
  ==================
    Total number: 3
    Errors By Type:

  API Keys status
  ===============
    API key ending with _KEY>: API Key invalid

==========
Endpoints
==========
  https://app.datadoghq.com - API Key ending with:
      - _KEY>

==========
Logs Agent
==========

  Logs Agent is not running

=========
Aggregator
=========
  Checks Metric Sample: 4,408
  Dogstatsd Metric Sample: 11
  Event: 2
  Events Flushed: 2
  Number Of Flushes: 4
  Series Flushed: 2,447
  Service Check: 48
  Service Checks Flushed: 48

=========
DogStatsD
=========
  Event Packets: 0
  Event Parse Errors: 0
  Metric Packets: 10
  Metric Parse Errors: 0
  Service Check Packets: 0
  Service Check Parse Errors: 0
  Udp Bytes: 554
  Udp Packet Reading Errors: 0
  Udp Packets: 11
  Uds Bytes: 0
  Uds Origin Detection Errors: 0
  Uds Packet Reading Errors: 0
  Uds Packets: 0

Describe what happened:

The agent crashes attempting to get metrics from API Server with this error:

2020-01-20 17:26:23 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kube_apiserver_metrics: [{"message": "HTTPSConnectionPool(host='192.168.64.3', port=6443): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f51fa1cb590>: Failed to establish a new connection: [Errno 111] Connection refused'))", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 157, in _new_conn\n    (self._dns_host, self.port), self.timeout, **extra_kw\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/connection.py\", line 84, in create_connection\n    raise err\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/connection.py\", line 74, in create_connection\n    sock.connect(sa)\nConnectionRefusedError: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 672, in urlopen\n    chunked=chunked,\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 376, in _make_request\n    self._validate_conn(conn)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 994, in _validate_conn\n    conn.connect()\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 334, in connect\n    conn = self._new_conn()\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connection.py\", line 169, in _new_conn\n    self, \"Failed to establish a new connection: %s\" % e\nurllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f51fa1cb590>: Failed to establish a new connection: [Errno 111] Connection refused\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/adapters.py\", line 449, in send\n    timeout=timeout\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/connectionpool.py\", line 720, in urlopen\n    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/urllib3/util/retry.py\", line 436, in increment\n    raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='192.168.64.3', port=6443): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f51fa1cb590>: Failed to establish a new connection: [Errno 111] Connection refused'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/base.py\", line 678, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/kube_apiserver_metrics/kube_apiserver_metrics.py\", line 70, in check\n    self.process(self.kube_apiserver_config, metric_transformers=self.metric_transformers)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 378, in process\n    for metric in self.scrape_metrics(scraper_config):\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 336, in scrape_metrics\n    response = self.poll(scraper_config)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 536, in poll\n    response = self.send_request(endpoint, scraper_config, headers)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/datadog_checks/base/checks/openmetrics/mixins.py\", line 601, in send_request\n    auth=auth,\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/api.py\", line 75, in get\n    return request('get', url, params=params, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/api.py\", line 60, in request\n    return session.request(method=method, url=url, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/sessions.py\", line 533, in request\n    resp = self.send(prep, **send_kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/sessions.py\", line 646, in send\n    r = adapter.send(request, **kwargs)\n  File \"/opt/datadog-agent/embedded/lib/python3.7/site-packages/requests/adapters.py\", line 516, in send\n    raise ConnectionError(e, request=request)\nrequests.exceptions.ConnectionError: HTTPSConnectionPool(host='192.168.64.3', port=6443): Max retries exceeded with url: /metrics (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f51fa1cb590>: Failed to establish a new connection: [Errno 111] Connection refused'))\n"}]

Describe what you expected:

No crashes :)

Steps to reproduce the issue:

Follow https://app.datadoghq.com/account/settings#agent/kubernetes setup, add a reference to the yaml into the Tiltfile, start tilt.

Additional environment details (Operating System, Cloud provider, etc):

I'm running in minikube 1.6.2 on Darwin 10.5.2 with Kubernetes 1.17.0 on Docker 19.03.5.

teaintegrations

Most helpful comment

I think I have made some progress.
First, you need to start minikube with --apiserver-port=6443. Then, you need your datadog-agent.yaml to not verify SSL:

$ git diff datadog-agent.yaml
diff --git a/services/datadog-agent/datadog-agent.yaml b/services/datadog-agent/datadog-agent.yaml
index 223878f..01af25a 100644
--- a/services/datadog-agent/datadog-agent.yaml
+++ b/services/datadog-agent/datadog-agent.yaml
@@ -48,6 +48,8 @@ spec:
聽 聽 聽 聽 聽 聽 聽 聽 聽fieldPath: status.hostIP
聽 聽 聽 聽 聽 聽- name: DD_APM_ENABLED
聽 聽 聽 聽 聽 聽 聽value: "true"
+ 聽 聽 聽 聽 聽- name: DD_KUBELET_TLS_VERIFY
+ 聽 聽 聽 聽 聽 聽value: "false"
聽 聽 聽 聽 聽resources:
聽 聽 聽 聽 聽 聽requests:
聽 聽 聽 聽 聽 聽 聽memory: "256Mi"

Finally, you can't use minikube start/stop. You need to fully delete the minikube cluster or the datadog agent loses track of doing name resolution. I don't like this solution since it doesn't identify what the real problem is - minikube start and stop should not break name resolution in the pods running in minikube, but it does. This may be a minikube issue, I dunno.聽

The datadog bug here is the agent not parsing the full URL (including the port) from the configuration of the API server; instead, the agent assumes port 6443.

All 6 comments

Doing some reading on this it looks like that while port 6443 is the Kubernetes default for the API server port, minikube runs it on 8443. Is there a way to configure which port Datadog tries to reach the API server on?

I've done some debugging on this. The issue is that the Datadog agent assumes port 6443 rather than pulling the full URI of the API server from k8s. This breaks minikube since minikube uses 8443 as the default port; I suspect this breaks other environments as well (for example, I think GKE's API server uses 443). I've not looked at the agent code to see if this is the case, but it aligns with the observed behavior in this issue.

I think I have made some progress.
First, you need to start minikube with --apiserver-port=6443. Then, you need your datadog-agent.yaml to not verify SSL:

$ git diff datadog-agent.yaml
diff --git a/services/datadog-agent/datadog-agent.yaml b/services/datadog-agent/datadog-agent.yaml
index 223878f..01af25a 100644
--- a/services/datadog-agent/datadog-agent.yaml
+++ b/services/datadog-agent/datadog-agent.yaml
@@ -48,6 +48,8 @@ spec:
聽 聽 聽 聽 聽 聽 聽 聽 聽fieldPath: status.hostIP
聽 聽 聽 聽 聽 聽- name: DD_APM_ENABLED
聽 聽 聽 聽 聽 聽 聽value: "true"
+ 聽 聽 聽 聽 聽- name: DD_KUBELET_TLS_VERIFY
+ 聽 聽 聽 聽 聽 聽value: "false"
聽 聽 聽 聽 聽resources:
聽 聽 聽 聽 聽 聽requests:
聽 聽 聽 聽 聽 聽 聽memory: "256Mi"

Finally, you can't use minikube start/stop. You need to fully delete the minikube cluster or the datadog agent loses track of doing name resolution. I don't like this solution since it doesn't identify what the real problem is - minikube start and stop should not break name resolution in the pods running in minikube, but it does. This may be a minikube issue, I dunno.聽

The datadog bug here is the agent not parsing the full URL (including the port) from the configuration of the API server; instead, the agent assumes port 6443.

I am seeing this too. It would be nice if the agent could auto-discover the url as mentioned above, or at the very least provide an environment variable override.

I am also running into this issue. Having to fully delete the minikube cluster and then reconfiguring everything is a bit of a hassle.

The 6443 port can be changed by following https://docs.datadoghq.com/agent/kubernetes/integrations/?tab=configmap#configuration

goal: set the minikube port 8443 in prometheus_url in autodiscovery for identifier kube-apiserver

(I tried to replace 6443 by %%port%% but it didn't work: 2021-03-22 18:06:11 UTC | CORE | WARN | (pkg/autodiscovery/autoconfig.go:537 in resolveTemplateForService) | error resolving template kube_apiserver_metrics for service docker://e4d6acef548dcb12d3a7990aa94f05bc7c149b7250ed425a105d5960a7cbf822: no port found for container docker://e4d6acef548dcb12d3a7990aa94f05bc7c149b7250ed425a105d5960a7cbf822 - ignoring it)

one way to do that:

$ cat autodiscovery-kube-apiserver.configmap.yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: kube-apiserver-config-map
  namespace: default
data:
  kube-apiserver-config: |-
    ad_identifiers:
      - kube-apiserver
    init_config:
    instances:

        ## @param prometheus_url - string - required
        ## The URL where your application metrics are exposed by Prometheus.
        #
      - prometheus_url: https://%%host%%:8443/metrics

        ## @param tags - list of strings - optional
        ## List of tags to attach to every metric, event and service check emitted by this integration.
        ##
        ## Learn more about tagging: https://docs.datadoghq.com/tagging/
        #
        tags:
          - apiserver:%%host%%
$ diff -u datadog-agent-logs.original.yaml datadog-agent-logs.yaml
--- datadog-agent-logs.original.yaml    2021-03-22 18:11:39.433275825 +0000
+++ datadog-agent-logs.yaml     2021-03-22 18:05:50.854362578 +0000
@@ -63,6 +53,8 @@
                   fieldPath: status.hostIP
             - name: KUBERNETES
               value: "yes"
+            - name: DD_KUBELET_TLS_VERIFY
+              value: "false"
             - name: DD_AC_EXCLUDE
               value: "name:datadog-agent"
             - name: DOCKER_HOST
@@ -120,6 +112,8 @@
               mountPath: /var/lib/docker/containers
               mountPropagation: None
               readOnly: true
+            - name: kube-apiserver-auto-config
+              mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/
           livenessProbe:
             failureThreshold: 6
             httpGet:
@@ -266,6 +260,12 @@
         - hostPath:
             path: /var/lib/docker/containers
           name: logdockercontainerpath
+        - name: kube-apiserver-auto-config
+          configMap:
+            name: kube-apiserver-config-map
+            items:
+            - key: kube-apiserver-config
+              path: auto_conf.yaml
       tolerations:
       affinity: {}
       serviceAccountName: "datadog-agent"
Was this page helpful?
0 / 5 - 0 ratings