Tell us about your request
By default, CoreDNS deployment does not have an anti-affinity rule to schedule coredns pods across AZ (fault domain). Please add this to the coredns deployment
Which service(s) is this request for?
EKS
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Newly created EKS clusters may run coredns pods on the same node, which makes the service less available
Are you currently working around this issue?
You can change coredns deployment after EKS cluster is created, but it would be nice to have it as default
(Anti-)affinity is a currently pretty flawed in Kubernetes, in that scheduling becomes highly inefficient. It becomes a scaling problem on large clusters. It also impacts the cluster-autoscaler performance. That's why you often see base kube-system services like DNS not using them.
On small clusters it is still useful, and would be a valuable option to add to eksctl. But it probably ought to be optional because it is not desirable for all users given the performance problems it causes.
One more efficient approach is to use the descheduler to remove duplicates on a node.
https://github.com/kubernetes-incubator/descheduler#removeduplicates
Another simple approach is to run more replicas of services like Core DNS so you are almost guaranteed multiple failure zones. The scheduler has a kind of round-robin behavior that makes starting on the same node unlikely. And rescheduled pods (e.g. from descheduler) almost always move to a different node.
https://github.com/kubernetes/kubernetes/issues/72479
"Pod Affinity/AntiAffinity is causing us a bunch of pain from performance/scalability perspective. Although scheduling SIG has done tremendous effort to improve it, it's still the very significant (most?) factor for scheduling throughput."
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler
"Please note that the above performance can be achieved only if NO pod affinity and anti-affinity is used on any of the pods. Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters."
I have a very small cluster, around a dozen nodes and both instances of coredns are currently running on the same ec2 node. Perhaps I am unlucky today?!
I was following documentation at https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html. A possible solution is to update the documentation about adding affinity (at your risk). The stable helm chart has optional anti/affinity.
[unrelated to this request] I think there should be many more replicas of coredns, especially for upgrades. Current version of coredns is much higher.
We have run at the same problem, and since we are running a very small cluster (3 nodes), increasing the number of CoreDNS replicas to ensure multiple AZ coverage did not seem as a good solution. We have patched the CoreDNS in a pipeline to include the anti affinity rules.
we also suffering from this issue, currently, we need to patch the CoreDNS deployment manually as post-installation flow, pretty dirty...
Most helpful comment
(Anti-)affinity is a currently pretty flawed in Kubernetes, in that scheduling becomes highly inefficient. It becomes a scaling problem on large clusters. It also impacts the
cluster-autoscalerperformance. That's why you often see basekube-systemservices like DNS not using them.On small clusters it is still useful, and would be a valuable option to add to
eksctl. But it probably ought to be optional because it is not desirable for all users given the performance problems it causes.One more efficient approach is to use the
deschedulerto remove duplicates on a node.https://github.com/kubernetes-incubator/descheduler#removeduplicates
Another simple approach is to run more replicas of services like Core DNS so you are almost guaranteed multiple failure zones. The scheduler has a kind of round-robin behavior that makes starting on the same node unlikely. And rescheduled pods (e.g. from
descheduler) almost always move to a different node.https://github.com/kubernetes/kubernetes/issues/72479
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler