Datadog-agent: Datadog cluster agent leader election failure causes metrics apis to be unavailable

Created on 28 Jun 2019 · 14Comments · Source: DataDog/datadog-agent

Leader election failed and then metrics are unavailable. Pod and container inside are running fine. I had to restart pods to recover from it.
-
Output of the info page (if this is a bug)

  Leader Election
  ===============
    Leader Election Status:  Failing
    Error: permanent failure in apiserver: retry number exceeded


  Custom Metrics Server
  =====================
    Error: permanent failure in apiserver: retry number exceeded


=========
Collector
=========

  Running Checks
  ==============

    kubernetes_apiserver
    --------------------
      Instance ID: kubernetes_apiserver [WARNING]
      Total Runs: 222
      Metric Samples: Last Run: 0, Total: 0
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 0, Total: 0
      Average Execution Time : 0s

      Warning: Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.

Describe what happened:
Deployed datadog-cluster-agent with 3 replicas

Describe what you expected:
It should recover automatically, if there's permanent failure, it should bring down the pod.

Additional environment details (Operating System, Cloud provider, etc):
Running K8s 1.12

Source

ankilosaurus

Most helpful comment

👋 They are two different deployments indeed (or to be precise, the Datadog Agent is a Daemonset and the Datadog Cluster Agent is a Deployment).

The purpose of the Datadog Cluster Agent is to avoid having all the Datadog Agent hit the APIServer to collect cluster level metadata for instance. Thus, if you have a large cluster (or if you want to use features that are specific to (as in, require) the Datadog Cluster Agent you need to disable the Leader Election on the Datadog Agents and only enable it on the Datadog Cluster Agent.
In large clusters (> ~500 nodes) we recommend having 3 replicas of the Datadog Cluster Agent, and this is when the Leader Election kicks in.

If this can shed more light, see the RBAC for the Datadog Agent that we recommend if you are using the Datadog Cluster Agent. You will notice that only the Kubelet's APIs are authorised. This is because the Datadog Agents should not be communicating with the APIServer anymore.

I hope this clarifies the set up, we will get to clarifying the doc soon!

Best,
.C

CharlyF on 22 Jul 2019

👍3

All 14 comments

Logs
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:01 UTC | ERROR | (cmd/cluster-agent/custommetrics/server.go:72 in makeProviderOrDie) | Could not build API Client: temporary failure in apiserver, will retry later: try delay not elapsed yet datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:01 UTC | ERROR | (cmd/cluster-agent/app/app.go:209 in start) | Could not start the custom metrics API server: temporary failure in apiserver, will retry later: try delay not elapsed yet datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | ERROR | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:126 in init) | Not Able to set up a client for the Leader Election: temporary failure in apiserver, will retry later: try delay not elapsed yet datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events. datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | INFO | (pkg/collector/runner/runner.go:324 in work) | Done running check kubernetes_apiserver datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:16 UTC | INFO | (pkg/forwarder/transaction.go:193 in Process) | Successfully posted payload to "https://1-2-0-app.agent.datadoghq.com/intake/?api_key=******************************", the agent will only log transaction success every 20 transactions datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events. datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | INFO | (pkg/collector/runner/runner.go:324 in work) | Done running check kubernetes_apiserver datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:32 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:42 UTC | ERROR | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:126 in init) | Not Able to set up a client for the Leader Election: temporary failure in apiserver, will retry later: check resources failed: event collection: "Get https://10.231.0.1:443/api/v1/events?limit=1&timeout=10s&timeoutSeconds=10: net/http: request canceled (Client.Timeout exceeded while awaiting headers)" datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:42 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.

ankilosaurus on 28 Jun 2019

Hey @aksgithub, thanks for opening up an issue. Could you open up a ticket with our support team and send a flare from one of your node agents and from your cluster agent? That should let us investigate the issue further. As a side note, I'd also recommend upgrading the cluster-agent to its latest version to make sure we're not troubleshooting any issues we've previously patched.

DylanLovesCoffee on 28 Jun 2019

created a support ticket -> https://help.datadoghq.com/hc/en-us/requests/235995.

Note that, we are running cluster agent as an external metrics provider only. Meaning, we have not integrated node agent with cluster agent. So, node agent is still fetching api server metrics independently.

ankilosaurus on 2 Jul 2019

updating to version 1.3.1. Fyi, we are running k8s version 1.12. Let me know if i need to update datadog_cluster_agent with specific version.

ankilosaurus on 3 Jul 2019

Hey @aksgithub - In order to serve external metrics the cluster agent needs to run the leader election by itself (meaning the option should not be activated in the daemonset of the node agent).
This is simply because we store the metrics fetched from Datadog into a configmap and only the leader can write in it. This allows to run the cluster agent in HA for larger clusters.

Can you make sure this is the case and let us know how that goes ?
Best,
.C

CharlyF on 3 Jul 2019

🎉1

@CharlyF yes, it's running in leader election mode and that's why there was a failure.

ankilosaurus on 3 Jul 2019

I was having an issue with my HPA backed by external metrics. Disabling leader election in the node agents fixed my issue.

Leader election was happening between the Cluster Agent and Node Agents.

lmansur on 9 Jul 2019

👍2

@CharlyF I've run into a similar problem today. I had leader election on both cluster-agent and datadog-agent but the hpa couldn't get external metrics like that. By setting leader election to false for the datadog-agent all started to work. Is there any special reason for this?

I was convinced that leader election would mean that only one agent will sent metrics to datadog by aggregating them. If I put leader election to false for the data-dog agents, do I loose this functionality?

andreamaruccia on 19 Jul 2019

Hello @andreamaruccia, indeed when using the Cluster Agent, you should not be running the leader election on the node agents.
The leader election is used for the collection of kubernetes events, the scheduling of cluster level checks and the collection of metrics from Datadog for the External Metrics Provider.

Each agent will send the metrics it collects, we do not send data to 1 agent to aggregate them so no worries on this front.

Thank you all for the feedback. I added a card in our backlog to improve the lifecycle and doc if the leader election is used on the node agent while the cluster agent is running.

CharlyF on 19 Jul 2019

👍1

@CharlyF, Should leader election in cluster agent be independent of leader election in datadog agent as those two are separate deployments ?

ankilosaurus on 22 Jul 2019

👋 They are two different deployments indeed (or to be precise, the Datadog Agent is a Daemonset and the Datadog Cluster Agent is a Deployment).

I hope this clarifies the set up, we will get to clarifying the doc soon!

Best,
.C

CharlyF on 22 Jul 2019

👍3

@CharlyF, Thank you for quickly responding. I do understand the purpose behind datadog cluster agent. However, currently we are running datadog cluster agent (3 replicas with leader election enabled)only for hpa purpose.
Meaning, we have not integrated(ex: DD_CLUSTER_AGENT_ENABLED is not set in node agent yaml) cluster agent with node agent. So, node agents are still pulling metrics from apiserver and reporting to Datadog.

So that brings one question: It seems we have leader election running in both deployment. Does it imply that, both are reporting api server metrics to datadog? Will there be double reporting?

ankilosaurus on 23 Jul 2019

@CharlyF Can you please answer my question above ?

On a side note, is there a health check endpoint i can rely to define liveness/readiness probe for datadog cluster agent ?

ankilosaurus on 2 Aug 2019

👋 Apologies for the delay!
This is not supported - If you use the Cluster Agent you should not have the leader election running on the node agents. This will yield unexpected results.
We added a disclaimer in the doc for the time being but queued up some work to be more resilient to this configuration error.

As for the liveness/readiness probe, you can set one on the port 443 of the cluster agent to ensure that the external metrics api is serving (if enabled). You can also use a probe similar to the one we have for the agent but using the agent status for instance.

I am going to close this as the initial issue is solved - Feel free to reach out to our solutions team if you have any further question!

Best,
.C

CharlyF on 2 Aug 2019

👍1

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Intermittent availibility of external metrics

agosto-calvinbehling · 3Comments

Error running check disk: [{"message": "[Errno 2] No such file or directory: '/host/proc/filesystems'"

amlwwalker · 5Comments

Logs in JSON format are prefixed with non-JSON

jonmoter · 5Comments

[Helm Chart/Cluster-Agent] MetricsAPI hardcoded to bind port 443, which under any reasonable PSP regeime is a non-starter

SleepyBrett · 4Comments

The environment variables DD_AC_EXCLUDE and DD_AC_INCLUDE don't seem to do anything.

nimeshksingh · 4Comments