Leader election failed and then metrics are unavailable. Pod and container inside are running fine. I had to restart pods to recover from it.
-
Output of the info page (if this is a bug)
Leader Election
===============
Leader Election Status: Failing
Error: permanent failure in apiserver: retry number exceeded
Custom Metrics Server
=====================
Error: permanent failure in apiserver: retry number exceeded
=========
Collector
=========
Running Checks
==============
kubernetes_apiserver
--------------------
Instance ID: kubernetes_apiserver [WARNING]
Total Runs: 222
Metric Samples: Last Run: 0, Total: 0
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 0s
Warning: Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.
Describe what happened:
Deployed datadog-cluster-agent with 3 replicas
Describe what you expected:
It should recover automatically, if there's permanent failure, it should bring down the pod.
Additional environment details (Operating System, Cloud provider, etc):
Running K8s 1.12
Logs
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:01 UTC | ERROR | (cmd/cluster-agent/custommetrics/server.go:72 in makeProviderOrDie) | Could not build API Client: temporary failure in apiserver, will retry later: try delay not elapsed yet
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:01 UTC | ERROR | (cmd/cluster-agent/app/app.go:209 in start) | Could not start the custom metrics API server: temporary failure in apiserver, will retry later: try delay not elapsed yet
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | ERROR | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:126 in init) | Not Able to set up a client for the Leader Election: temporary failure in apiserver, will retry later: try delay not elapsed yet
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:02 UTC | INFO | (pkg/collector/runner/runner.go:324 in work) | Done running check kubernetes_apiserver
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:16 UTC | INFO | (pkg/forwarder/transaction.go:193 in Process) | Successfully posted payload to "https://1-2-0-app.agent.datadoghq.com/intake/?api_key=******************************", the agent will only log transaction success every 20 transactions
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:17 UTC | INFO | (pkg/collector/runner/runner.go:324 in work) | Done running check kubernetes_apiserver
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:32 UTC | INFO | (pkg/collector/runner/runner.go:258 in work) | Running check kubernetes_apiserver
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:42 UTC | ERROR | (pkg/util/kubernetes/apiserver/leaderelection/leaderelection.go:126 in init) | Not Able to set up a client for the Leader Election: temporary failure in apiserver, will retry later: check resources failed: event collection: "Get https://10.231.0.1:443/api/v1/events?limit=1&timeout=10s&timeoutSeconds=10: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
datadog-cluster-agent-64d9f6b66b-6lq5s datadog-cluster-agent 2019-06-28 17:32:42 UTC | WARN | (pkg/collector/corechecks/checkbase.go:106 in Warn) | Failed to instantiate the Leader Elector. Not running the Kubernetes API Server check or collecting Kubernetes Events.
Hey @aksgithub, thanks for opening up an issue. Could you open up a ticket with our support team and send a flare from one of your node agents and from your cluster agent? That should let us investigate the issue further. As a side note, I'd also recommend upgrading the cluster-agent to its latest version to make sure we're not troubleshooting any issues we've previously patched.
created a support ticket -> https://help.datadoghq.com/hc/en-us/requests/235995.
Note that, we are running cluster agent as an external metrics provider only. Meaning, we have not integrated node agent with cluster agent. So, node agent is still fetching api server metrics independently.
updating to version 1.3.1. Fyi, we are running k8s version 1.12. Let me know if i need to update datadog_cluster_agent with specific version.
Hey @aksgithub - In order to serve external metrics the cluster agent needs to run the leader election by itself (meaning the option should not be activated in the daemonset of the node agent).
This is simply because we store the metrics fetched from Datadog into a configmap and only the leader can write in it. This allows to run the cluster agent in HA for larger clusters.
Can you make sure this is the case and let us know how that goes ?
Best,
.C
@CharlyF yes, it's running in leader election mode and that's why there was a failure.
I was having an issue with my HPA backed by external metrics. Disabling leader election in the node agents fixed my issue.
Leader election was happening between the Cluster Agent and Node Agents.
@CharlyF I've run into a similar problem today. I had leader election on both cluster-agent and datadog-agent but the hpa couldn't get external metrics like that. By setting leader election to false for the datadog-agent all started to work. Is there any special reason for this?
I was convinced that leader election would mean that only one agent will sent metrics to datadog by aggregating them. If I put leader election to false for the data-dog agents, do I loose this functionality?
Hello @andreamaruccia, indeed when using the Cluster Agent, you should not be running the leader election on the node agents.
The leader election is used for the collection of kubernetes events, the scheduling of cluster level checks and the collection of metrics from Datadog for the External Metrics Provider.
Each agent will send the metrics it collects, we do not send data to 1 agent to aggregate them so no worries on this front.
Thank you all for the feedback. I added a card in our backlog to improve the lifecycle and doc if the leader election is used on the node agent while the cluster agent is running.
@CharlyF, Should leader election in cluster agent be independent of leader election in datadog agent as those two are separate deployments ?
馃憢 They are two different deployments indeed (or to be precise, the Datadog Agent is a Daemonset and the Datadog Cluster Agent is a Deployment).
The purpose of the Datadog Cluster Agent is to avoid having all the Datadog Agent hit the APIServer to collect cluster level metadata for instance. Thus, if you have a large cluster (or if you want to use features that are specific to (as in, require) the Datadog Cluster Agent you need to disable the Leader Election on the Datadog Agents and only enable it on the Datadog Cluster Agent.
In large clusters (> ~500 nodes) we recommend having 3 replicas of the Datadog Cluster Agent, and this is when the Leader Election kicks in.
If this can shed more light, see the RBAC for the Datadog Agent that we recommend if you are using the Datadog Cluster Agent. You will notice that only the Kubelet's APIs are authorised. This is because the Datadog Agents should not be communicating with the APIServer anymore.
I hope this clarifies the set up, we will get to clarifying the doc soon!
Best,
.C
@CharlyF, Thank you for quickly responding. I do understand the purpose behind datadog cluster agent. However, currently we are running datadog cluster agent (3 replicas with leader election enabled)only for hpa purpose.
Meaning, we have not integrated(ex: DD_CLUSTER_AGENT_ENABLED is not set in node agent yaml) cluster agent with node agent. So, node agents are still pulling metrics from apiserver and reporting to Datadog.
So that brings one question: It seems we have leader election running in both deployment. Does it imply that, both are reporting api server metrics to datadog? Will there be double reporting?
@CharlyF Can you please answer my question above ?
On a side note, is there a health check endpoint i can rely to define liveness/readiness probe for datadog cluster agent ?
馃憢 Apologies for the delay!
This is not supported - If you use the Cluster Agent you should not have the leader election running on the node agents. This will yield unexpected results.
We added a disclaimer in the doc for the time being but queued up some work to be more resilient to this configuration error.
As for the liveness/readiness probe, you can set one on the port 443 of the cluster agent to ensure that the external metrics api is serving (if enabled). You can also use a probe similar to the one we have for the agent but using the agent status for instance.
I am going to close this as the initial issue is solved - Feel free to reach out to our solutions team if you have any further question!
Best,
.C
Most helpful comment
馃憢 They are two different deployments indeed (or to be precise, the Datadog Agent is a Daemonset and the Datadog Cluster Agent is a Deployment).
The purpose of the Datadog Cluster Agent is to avoid having all the Datadog Agent hit the APIServer to collect cluster level metadata for instance. Thus, if you have a large cluster (or if you want to use features that are specific to (as in, require) the Datadog Cluster Agent you need to disable the Leader Election on the Datadog Agents and only enable it on the Datadog Cluster Agent.
In large clusters (> ~500 nodes) we recommend having 3 replicas of the Datadog Cluster Agent, and this is when the Leader Election kicks in.
If this can shed more light, see the RBAC for the Datadog Agent that we recommend if you are using the Datadog Cluster Agent. You will notice that only the Kubelet's APIs are authorised. This is because the Datadog Agents should not be communicating with the APIServer anymore.
I hope this clarifies the set up, we will get to clarifying the doc soon!
Best,
.C