Containers-roadmap: [EKS] [request]: add anti-affinity to coredns deployment to schedule pods across AZ

Created on 16 May 2019 · 4Comments · Source: aws/containers-roadmap

Tell us about your request
By default, CoreDNS deployment does not have an anti-affinity rule to schedule coredns pods across AZ (fault domain). Please add this to the coredns deployment

Which service(s) is this request for?
EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
Newly created EKS clusters may run coredns pods on the same node, which makes the service less available

Are you currently working around this issue?
You can change coredns deployment after EKS cluster is created, but it would be nice to have it as default

EKS Proposed

Source

adaniline-traderev

👍8

Most helpful comment

(Anti-)affinity is a currently pretty flawed in Kubernetes, in that scheduling becomes highly inefficient. It becomes a scaling problem on large clusters. It also impacts the cluster-autoscaler performance. That's why you often see base kube-system services like DNS not using them.

On small clusters it is still useful, and would be a valuable option to add to eksctl. But it probably ought to be optional because it is not desirable for all users given the performance problems it causes.

One more efficient approach is to use the descheduler to remove duplicates on a node.
https://github.com/kubernetes-incubator/descheduler#removeduplicates

Another simple approach is to run more replicas of services like Core DNS so you are almost guaranteed multiple failure zones. The scheduler has a kind of round-robin behavior that makes starting on the same node unlikely. And rescheduled pods (e.g. from descheduler) almost always move to a different node.

https://github.com/kubernetes/kubernetes/issues/72479

"Pod Affinity/AntiAffinity is causing us a bunch of pain from performance/scalability perspective. Although scheduling SIG has done tremendous effort to improve it, it's still the very significant (most?) factor for scheduling throughput."

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler

"Please note that the above performance can be achieved only if NO pod affinity and anti-affinity is used on any of the pods. Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters."

whereisaaron on 17 May 2019

❤3

All 4 comments

One more efficient approach is to use the descheduler to remove duplicates on a node.
https://github.com/kubernetes-incubator/descheduler#removeduplicates

https://github.com/kubernetes/kubernetes/issues/72479

"Pod Affinity/AntiAffinity is causing us a bunch of pain from performance/scalability perspective. Although scheduling SIG has done tremendous effort to improve it, it's still the very significant (most?) factor for scheduling throughput."

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-service-level-objectives-for-cluster-autoscaler

"Please note that the above performance can be achieved only if NO pod affinity and anti-affinity is used on any of the pods. Unfortunately, the current implementation of the affinity predicate in scheduler is about 3 orders of magnitude slower than for all other predicates combined, and it makes CA hardly usable on big clusters."

whereisaaron on 17 May 2019

❤3

I have a very small cluster, around a dozen nodes and both instances of coredns are currently running on the same ec2 node. Perhaps I am unlucky today?!

I was following documentation at https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html. A possible solution is to update the documentation about adding affinity (at your risk). The stable helm chart has optional anti/affinity.

[unrelated to this request] I think there should be many more replicas of coredns, especially for upgrades. Current version of coredns is much higher.

bluebenno on 12 Dec 2019

We have run at the same problem, and since we are running a very small cluster (3 nodes), increasing the number of CoreDNS replicas to ensure multiple AZ coverage did not seem as a good solution. We have patched the CoreDNS in a pipeline to include the anti affinity rules.