Kubeadm: DNS service runs entirely on a single node

Created on 9 Jul 2019  路  25Comments  路  Source: kubernetes/kubeadm

BUG REPORT

Versions

kubeadm version (use kubeadm version):
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:37:41Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:40:16Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    Brightbox
  • OS (e.g. from /etc/os-release):
    Ubuntu 18.04.2
  • Kernel (e.g. uname -a):
    Linux srv-uq9br 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Others:

What happened?

The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.

What you expected to happen?

When there is more than one node in the cluster the pods should be spread across the nodes.

How to reproduce it (as minimally and precisely as possible)?

  • Install a kubeadm cluster with CoreDNS running and one or more additional worker or master nodes
  • Run kubectl -n kube-system get pods --selector=k8s-app=kube-dns -o wide

Anything else we need to know?

areecosystem prioritawaiting-more-evidence sicloud-provider sischeduling

Most helpful comment

@NeilW i found more time to investigate this problem.

what you are looking for is podAntiAffinity to prevent the coredns Pods scheduling on the same nodes and leave one of them Pending until more nodes are available:

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   coredns-d757fdb54-4pz92              0/1     Pending   0          5m52s
kube-system   coredns-d757fdb54-m5r69              1/1     Running   0          5m52s

the workaround to your problem is the following:

# call kubeadm init and let kubeadm deploy the default coredns
kubeadm init ...
# patch the coredns deployment
kubectl patch deployment coredns -n kube-system --type merge --patch "$(cat coredns-aa-patch.yaml)"
# apply pod network
...
$ cat coredns-aa-patch.yaml 
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: k8s-app
                    operator: In
                    values:
                    - kube-dns
              topologyKey: "kubernetes.io/hostname"

by adding more nodes, the second Pod will schedule on another node.

unfortunately we cannot add this as part of the default coredns deployment in kubeadm for an important reason:

  • if one of the Pods is Pending this will break a lot of e2e tests and possibly some logic in the e2e testing framework of kubernetes too that assumes that all Pods will be running before starting tests!

documentation reference:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

instead of requiredDuringSchedulingRequiredDuringExecution which is not supported yet, we need something like ignoredDuringSchedulingRequiredDuringExecution. this can result in the Pods scheduling on the same node, but once more nodes are available they will automatically spread.
(but such a anti-affinity category may never be added).

there might be other tricks to manage that currently, but i think k8s may be lacking a feature here.
other links:
https://github.com/kubernetes/kubernetes/issues/12140
https://github.com/entelo/reschedule
https://github.com/kubernetes-incubator/descheduler
https://stackoverflow.com/a/41930830

i'm closing this ticket given the provided workaround, but feel free to continue the discussion.

All 25 comments

I've looked at the other tickets on this and the problems with anti-affinity and the like, and I'd suggest that perhaps something as simple as kubeadm looking at the pod list on critical services after a join and kicking a pod out if it sees them all on the same node would do the trick.

kubectl -n kube-system rollout restart deployment coredns seems to do the trick

The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.

i just tested this with a couple of scenarios using k8s 1.16.0-alpha (i don't think anything related changed between 1.15 and this version):

A) on a cluster with 3 CP / 3 worker nodes:

the coredns pods got deployed on different workers.
force removing one of the worker nodes resulted in a coredns pod being created on another worker immediately.

B) on a cluster with a 3 CP nodes without workers:

the coredns pods got deployed on the same primary control-plane.
removing the control plane node resulted in both coredns pods being scheduled immediately on the next control-plane node.

The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.

i cannot seem to reproduce the problem you are describing.
if there are workers in the cluster coredns should tollerate them because of:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/addons/dns/manifests.go#L241-L245

are there missing details in the report? anything special about the cluster - how many CP / worker nodes?

pod eviction timeout

you can also technically modify this value from the default of 5m.

I bring up the master first, then add in worker nodes, then add in master CP replicas. The CoreDNS replicas get scheduled on the initial master as it is being configured.

I'm pulling down the brightbox cloud-controller and running that deployment prior to joining the worker nodes, so it's possible that --cloud-provider=external is interfering given the delay in the worker nodes being marked as ready

The CNI is calico and I'm applying that prior to joining the workers.

However from the Terraform traces it all appears to run in parallel.

The test cluster is 3 worker nodes and 3 CP nodes

@kubernetes/sig-cluster-lifecycle

technically if you just have the single primary CP for a while, both coredns pods will just land on it.
i'm not aware of a k8s mechanic that will allow, say, one of the coredns pods to auto-move to another node once more nodes are up. maybe some clever play with taints/tolerations can be done to only have one replica on the primary CP and the other replica can wait for another node..

you can technically just remove those pods and the replication controller will/should reconcile the deployment on the available workers.

but something odd is going on here, because even if the primary CP is removed after all nodes are up, the pods should still schedule elsewhere right away without waiting for --pod-eviction-timeout.

so it's possible that --cloud-provider=external is interfering

it could be cloud provider specific and we don't have test signal for such scenarios.

in any case i don't think that's a kubeadm bug, but possibly something for sig scheduling / sig cloud-provider to investigate.

on the kubeadm side, we can consider as an issue that coredns replicas can land on the same node, yet i cannot reproduce reconciliation issues related to that - i.e. the service remains up.

The DNS service remains up, but it fails to work

$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    172.30.0.10
Address 1: 172.30.0.10

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
  • are the coredns pods reporting errors?
  • is this a permanent breakage or does it happen for a period of time?

i don't see how this relates to:

When there is more than one node in the cluster the pods should be spread across the nodes.

When the control plane node with both coredns pods on it goes down, the coredns service is rendered unusable until the other two control plane nodes decide to terminate the pods and restart the deployment - seemingly after the pod eviction timeout expires.

If one of the pods is reallocated away from the control plane node (I'm using the rollout restart as above) prior to the node going down then, of course, the coredns service redundancy works as expected.

are you sure this command is working before the primary CP node is removed?

kubectl exec -ti busybox -- nslookup kubernetes.default

locally i cannot get the same command to work on a brand new cluster. see:
https://github.com/kubernetes/kubernetes/issues/66924#issuecomment-411804435

alpine is known to use the "wrong" DNS system library.
is this busybox image alpine based?

If one of the pods is reallocated away from the control plane node (I'm using the rollout restart as above) prior to the node going down then, of course, the coredns service redundancy works as expected.

nope, cannot reproduce this.

  • created a single CP node cluster
  • coredns pods are scheduled on primary CP node
  • add more CP and worker nodes
  • deleted primary CP node
  • coredns pods land on workers right away
  • dns resolve continues to work
kubectl exec -ti debian-stretch -- nslookup kubernetes.default
Server:     10.96.0.10
Address:    10.96.0.10#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

note: debian-stretch needs dnsutils

The command is working. Before and after is as follows

$ kubectl apply -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    172.30.0.10
Address 1: 172.30.0.10 kube-dns.kube-system.svc.group.local

Name:      kubernetes.default
Address 1: 172.30.0.1 kubernetes.default.svc.group.local
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    172.30.0.10
Address 1: 172.30.0.10

nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

If your setup is rescheduling coredns instantly, then that's the difference.
After I kill the node (ie simulate a hard crash on the master node running the coredns pods) I get

$ kubectl get nodes
NAME        STATUS     ROLES    AGE     VERSION
srv-1okfc   Ready      master   7m25s   v1.15.0
srv-24ipi   NotReady   master   9m22s   v1.15.0
srv-93pp4   Ready      <none>   8m28s   v1.15.0
srv-i068b   Ready      <none>   8m42s   v1.15.0
srv-t7si4   Ready      <none>   8m30s   v1.15.0
srv-y5168   Ready      master   7m40s   v1.15.0
$ kubectl -n kube-system get pods --selector=k8s-app=kube-dns -o wide
NAME                       READY   STATUS    RESTARTS   AGE     IP                NODE        NOMINATED NODE   READINESS GATES
coredns-5c98db65d4-45k57   1/1     Running   0          9m12s   192.168.155.194   srv-24ipi   <none>           <none>
coredns-5c98db65d4-qc4fh   1/1     Running   0          9m12s   192.168.155.195   srv-24ipi   <none>           <none>

for several minutes.

So this is some sort of re-scheduling issue.

Is the setup you are running zone aware from the cloud provider labels (beta.kubernetes.io/instance-type, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone)?

The kube-controller that takes over once the master node is killed logs the following

I0710 03:49:59.381241       1 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"srv-cenne", UID:"5e6aa24c-05b6-4823-9bd0-76b43a95164e", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node srv-cenne status is now: NodeNotReady
I0710 03:55:04.733763       1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: kube-system/coredns-5c98db65d4-g4n9v
I0710 03:55:04.733853       1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: kube-system/coredns-5c98db65d4-fsq54

This fits with the tolerations on the core-dns pods

  tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

So I'm now not sure why your core-dns pods are being evicted.

@NeilW - my guess is that pods are evicted due to tolerationSeconds: 300

So this is some sort of re-scheduling issue.

indeed.

Is the setup you are running zone aware from the cloud provider labels (beta.kubernetes.io/instance-type, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone)?

it's a local cluster running the nodes as docker containers (see /kinder in this repo).

The command is working. Before and after is as follows

very odd because someone just confirmed that busybox does not work for them at all.
https://github.com/kubernetes/kubeadm/issues/1659

did you try a different image (not busybox) just as a sanity check for the before and after?

Running

$ kubectl run --generator=run-pod/v1 --image=ubuntu --rm -i --tty my-shell -- bash
If you don't see a command prompt, try pressing enter.
root@my-shell:/# apt-get update

before crashing the master node gives

Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]      
Get:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]    
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB] 

and after you get

Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease                  
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease        
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'

If you run the rollout restart then it goes back to normal.

Remember that I'm crashing the master node here, not deleting it. It remains in the node list and goes to 'Not Ready'.

what exact method are using to crash the node?
i just "shut down" the node and then did a kubectl delete no, right after.

actually does kubectl delete no affect the situation in your case?

EDIT: i will attempt to retest again.
but i guess it's a question if a node crashes without an operator around, it might fail DNS for a while.
still seems like something that sig-scheduling/sig-node should know about.

I don't delete the node. I'm testing resilience in the face of a failure zone crash and subsequent recovery.

If you redeploy the core-dns service after workers have been added into the cluster there is no interruption when the node is crashed, as the standard zone aware anti-affinity process moves them to two separate nodes.

The way we build clusters with kubeadm (by joining in new nodes to an initial node) tends to leave the coredns pods on a single master node. I'm automating it with terraform and even that doesn't build the nodes fast enough to get a distribution first time, the majority of the time.

@NeilW i found more time to investigate this problem.

what you are looking for is podAntiAffinity to prevent the coredns Pods scheduling on the same nodes and leave one of them Pending until more nodes are available:

NAMESPACE     NAME                                 READY   STATUS    RESTARTS   AGE
kube-system   coredns-d757fdb54-4pz92              0/1     Pending   0          5m52s
kube-system   coredns-d757fdb54-m5r69              1/1     Running   0          5m52s

the workaround to your problem is the following:

# call kubeadm init and let kubeadm deploy the default coredns
kubeadm init ...
# patch the coredns deployment
kubectl patch deployment coredns -n kube-system --type merge --patch "$(cat coredns-aa-patch.yaml)"
# apply pod network
...
$ cat coredns-aa-patch.yaml 
spec:
  template:
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: k8s-app
                    operator: In
                    values:
                    - kube-dns
              topologyKey: "kubernetes.io/hostname"

by adding more nodes, the second Pod will schedule on another node.

unfortunately we cannot add this as part of the default coredns deployment in kubeadm for an important reason:

  • if one of the Pods is Pending this will break a lot of e2e tests and possibly some logic in the e2e testing framework of kubernetes too that assumes that all Pods will be running before starting tests!

documentation reference:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/

instead of requiredDuringSchedulingRequiredDuringExecution which is not supported yet, we need something like ignoredDuringSchedulingRequiredDuringExecution. this can result in the Pods scheduling on the same node, but once more nodes are available they will automatically spread.
(but such a anti-affinity category may never be added).

there might be other tricks to manage that currently, but i think k8s may be lacking a feature here.
other links:
https://github.com/kubernetes/kubernetes/issues/12140
https://github.com/entelo/reschedule
https://github.com/kubernetes-incubator/descheduler
https://stackoverflow.com/a/41930830

i'm closing this ticket given the provided workaround, but feel free to continue the discussion.

about:
https://github.com/kubernetes/kubeadm/issues/1657#issuecomment-509888479

So this is some sort of re-scheduling issue.

seems like so, see:
https://github.com/kubernetes/kubernetes/issues/55713

people are saying that --pod-eviction-timeout is not working as expected.

@NeilW
it seems the scheduling behavior that you are seeing is by design.

have a look at the discussion here:
https://github.com/kubernetes/kubernetes/issues/55713#issuecomment-517982549

@neolit123 Would there be any reason to use your antiAffinity workaround over this fix? https://github.com/kubernetes/kops/pull/7400/files

@bcoughlan there are multiple ways to solve the issue.
this is mostly the reason why we haven't decided on how to proceed in kubeadm.

but looking at:
https://github.com/kubernetes/kops/pull/7400/files

we don't have plans to deploy the autoscaler with kubeadm by default.

i think the best option for users right now is to just delete the core-dns pods once multiple nodes are up and this will cause them to re-sched on different nodes.

@neolit123 Thanks. At a closer look I see that the kops file also has the antiAffinity workaround - otherwise when a node is added, the replica count is bumped but the node being added still has taints which causes the coredns pod to be scheduled on the same node again.

So it looks like your workaround is needed either way.

Was this page helpful?
0 / 5 - 0 ratings