Datadog-agent: Failure to connect to kubelet

Created on 17 Oct 2018  路  12Comments  路  Source: DataDog/datadog-agent

Hi,

I am getting logs in datadog, however the agent logs on K8 have the following errors:

[ TRACE ] trace-agent exited with code 0, disabling
[ AGENT ] 2018-10-17 08:18:24 UTC | WARN | (datadog_agent.go:149 in LogMessage) | (base.py:212) | DEPRECATION NOTICE: device_name is deprecated, please use a device: tag in the tags list instead
[ AGENT ] 2018-10-17 08:18:26 UTC | ERROR | (kubeutil.go:50 in GetKubeletConnectionInfo) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
[ AGENT ] 2018-10-17 08:18:26 UTC | ERROR | (runner.go:289 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/checks/base.py\", line 352, in run\n self.check(copy.deepcopy(self.instances[0]))\n File \"/opt/datadog-agent/embedded/lib/python2.7/site-packages/datadog_checks/kubelet/kubelet.py\", line 107, in check\n raise CheckException(\"Unable to detect the kubelet URL automatically.\")\nCheckException: Unable to detect the kubelet URL automatically.\n"}]
[ AGENT ] 2018-10-17 08:18:28 UTC | ERROR | (autoconfig.go:604 in collect) | Unable to collect configurations from provider Kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet

image:
repository: datadog/agent
tag: 6.4.2
pullPolicy: IfNotPresent

Most helpful comment

@Aiqbal1234 So I managed to get around this as when I updated my cluster from 1.10 --> 1.11 the kubelet checks started failing with the same message (EKS platform V2, k8s 1.11). I don't think that the version of the agent is entirely the problem.

What I did to resolve it was:

  1. Deployed the requisite kube-state-metrics
  2. Pull all of my datadog daemonset yamls. Even though previous deployments of the same yamls had worked, I suspect that using latest of the agent had some impact collecting metrics from the cluster after I upgraded. I grabbed the RBAC-related details from links on this page and then created a manifest using the example.
  3. Did a similar process for the Cluster Agent

All 12 comments

馃憤
Horrible version upgrade, how did that go untested? :(
Going back to 6.5.1.

@midN
will 6.5.1 fix this issue?

@Aiqbal1234 Yes, going back to 6.5.1 fixed it for us.

I've got the same issue, but reverting to 6.5.1 does not solve the problem.

@Aiqbal1234 So I managed to get around this as when I updated my cluster from 1.10 --> 1.11 the kubelet checks started failing with the same message (EKS platform V2, k8s 1.11). I don't think that the version of the agent is entirely the problem.

What I did to resolve it was:

  1. Deployed the requisite kube-state-metrics
  2. Pull all of my datadog daemonset yamls. Even though previous deployments of the same yamls had worked, I suspect that using latest of the agent had some impact collecting metrics from the cluster after I upgraded. I grabbed the RBAC-related details from links on this page and then created a manifest using the example.
  3. Did a similar process for the Cluster Agent

@fontanese - Could you confirm if you only had issues reaching the kubelet ? Or were you not able to reach the API Server as well ?
From the log above, i'd need to confirm if the right env var was set in the ds:
It should be this:

          - name: DD_KUBERNETES_KUBELET_HOST
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP

To get the IP from the downward API.
Now, unless the error log is the same, it could also be a permission issue (tied to the RBAC and the service account token used) which _might_ have been solved during the upgrade ?

Thank you for sharing here, and apologies for the headache.
Best,
.C

@CharlyF FYI - That won't work for anyone running in AWS EKS. The AWS CNI plugin assigns every pod it's own private ip.

In Kube deployed in AWS you are going to have to reach datadog through the service.

馃憢 Thanks for weighing in!
I have been using EKS for some time now and using the downward API to reach the kubelet from the pod of the Agent has been working well.
Looking at the systemd unit of the kubelet I see:

# systemctl cat kubelet.service --no-pager
# /etc/systemd/system/kubelet.service
[Unit]
Description=Kubernetes Kubelet
Documentation=https://github.com/kubernetes/kubernetes
After=docker.service
Requires=docker.service

[Service]
ExecStart=/usr/bin/kubelet \
  --address=0.0.0.0 \
[...]
# /etc/systemd/system/kubelet.service.d/10-kubelet-args.conf
[Service]
Environment='KUBELET_ARGS=--node-ip=172.31.40.137 --cluster-dns=10.100.0.10 --pod-infra-container-image=602401143452.dkr.ecr.us-east-2.amazonaws.com/eks/pause-amd64:3.1'

So it is listening to all interfaces, and it is advertising this 172.31.40.137 node ip to the APIServer.
Now, in the agent's pod I have:

root@datadog-agent-b6b6v:/# env | grep DD_KUBERNETES_KUBELET_HOST
DD_KUBERNETES_KUBELET_HOST=172.31.40.137

Which matches.

Then, we have no troubles reaching it:

root@datadog-agent-b6b6v:/# curl $DD_KUBERNETES_KUBELET_HOST:10255/healthz
ok

Now, it seems like you are referring to the communication to the agent - Which I is a different topic than the one mentioned in this issue.
Using a service to communicate with the agent is a solution indeed - Although, for local communications, we recommend using the downward API in your app and sending traces or dogstatsd payloads to the appropriate ports (8126/8125). You also need to open those ports to the agent (hostPort: 8126 or 8125 in the Daemonset).

Let me know if this makes sense.
Best,
.C

Hi,
root@datadog-agent-58r6p:/# curl $DD_KUBERNETES_KUBELET_HOST:10255/healthz
curl: (7) Failed to connect to 10.24.78.64 port 10255: Connection refused

What should I do?

2020-06-23 09:45:37 UTC | CORE | INFO | (pkg/logs/input/container/launcher.go:38 in NewLauncher) | Could not setup the kubernetes launcher: temporary failure in kubeutil, will retry later: cannot set a valid kubelet host: cannot connect to kubelet using any of the given hosts: [10.240.0.5] [aks-agentpool-11552782-vmss000001.internal.cloudapp.net. aks-agentpool-11552782-vmss000001], Errors: [Get https://10.240.0.5:10250/pods: x509: cannot validate certificate for 10.240.0.5 because it doesn't contain any IP SANs Get https://aks-agentpool-11552782-vmss000001.internal.cloudapp.net.:10250/pods: x509: certificate is valid for aks-agentpool-11552782-vmss000001, not aks-agentpool-11552782-vmss000001.internal.cloudapp.net. Get https://aks-agentpool-11552782-vmss000001:10250/pods: x509: certificate signed by unknown authority cannot connect: http: "Get http://10.240.0.5:10255/: dial tcp 10.240.0.5:10255: connect: connection refused" cannot connect: http: "Get http://aks-agentpool-11552782-vmss000001.internal.cloudapp.net.:10255/: dial tcp 10.240.0.5:10255: connect: connection refused" cannot connect: http: "Get http://aks-agentpool-11552782-vmss000001:10255/: dial tcp 10.240.0.5:10255: connect: connection refused"]

2020-06-23 09:47:14 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 820, in run\n self.check(instance)\n File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 291, in check\n raise CheckException(\"Unable to detect the kubelet URL automatically.\")\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.\n"}]

Also seeing this now. Weird thing is that it started about 6 days ago, on a new AKS cluster which had been running for about 2 weeks and has produced data ingested in Datadog (mainly logs).

2020-09-12T20:26:26.839013998Z 2020-09-12 20:26:26 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 827, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 297, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically.\")\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.\n"}]
2020-09-12T20:26:34.630932898Z 2020-09-12 20:26:34 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:26:41.839154350Z 2020-09-12 20:26:41 UTC | CORE | ERROR | (pkg/collector/python/kubeutil.go:38 in getConnections) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:26:41.839184152Z 2020-09-12 20:26:41 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 827, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 297, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically.\")\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.\n"}]
2020-09-12T20:26:44.630902490Z 2020-09-12 20:26:44 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:26:54.630977789Z 2020-09-12 20:26:54 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:26:56.838370937Z 2020-09-12 20:26:56 UTC | CORE | ERROR | (pkg/collector/python/kubeutil.go:38 in getConnections) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:26:56.838969278Z 2020-09-12 20:26:56 UTC | CORE | ERROR | (pkg/collector/runner/runner.go:292 in work) | Error running check kubelet: [{"message": "Unable to detect the kubelet URL automatically.", "traceback": "Traceback (most recent call last):\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/base/checks/base.py\", line 827, in run\n    self.check(instance)\n  File \"/opt/datadog-agent/embedded/lib/python3.8/site-packages/datadog_checks/kubelet/kubelet.py\", line 297, in check\n    raise CheckException(\"Unable to detect the kubelet URL automatically.\")\ndatadog_checks.base.errors.CheckException: Unable to detect the kubelet URL automatically.\n"}]
2020-09-12T20:27:04.632353674Z 2020-09-12 20:27:04 UTC | CORE | ERROR | (pkg/autodiscovery/config_poller.go:123 in collect) | Unable to collect configurations from provider kubernetes: temporary failure in kubeutil, will retry later: try delay not elapsed yet
2020-09-12T20:27:11.838674496Z 2020-09-12 20:27:11 UTC | CORE | ERROR | (pkg/collector/python/kubeutil.go:38 in getConnections) | connection to kubelet failed: temporary failure in kubeutil, will retry later: try delay not elapsed yet
Was this page helpful?
0 / 5 - 0 ratings