BUG REPORT
kubeadm version (use kubeadm version):
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:37:41Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version):uname -a):The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.
When there is more than one node in the cluster the pods should be spread across the nodes.
kubectl -n kube-system get pods --selector=k8s-app=kube-dns -o wideI've looked at the other tickets on this and the problems with anti-affinity and the like, and I'd suggest that perhaps something as simple as kubeadm looking at the pod list on critical services after a join and kicking a pod out if it sees them all on the same node would do the trick.
kubectl -n kube-system rollout restart deployment coredns seems to do the trick
The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.
i just tested this with a couple of scenarios using k8s 1.16.0-alpha (i don't think anything related changed between 1.15 and this version):
A) on a cluster with 3 CP / 3 worker nodes:
the coredns pods got deployed on different workers.
force removing one of the worker nodes resulted in a coredns pod being created on another worker immediately.
B) on a cluster with a 3 CP nodes without workers:
the coredns pods got deployed on the same primary control-plane.
removing the control plane node resulted in both coredns pods being scheduled immediately on the next control-plane node.
The CoreDNS service pods are deployed on the same master node. When the node is lost, the DNS service is down on the cluster until the pod eviction timeout expires.
i cannot seem to reproduce the problem you are describing.
if there are workers in the cluster coredns should tollerate them because of:
https://github.com/kubernetes/kubernetes/blob/master/cmd/kubeadm/app/phases/addons/dns/manifests.go#L241-L245
are there missing details in the report? anything special about the cluster - how many CP / worker nodes?
pod eviction timeout
you can also technically modify this value from the default of 5m.
I bring up the master first, then add in worker nodes, then add in master CP replicas. The CoreDNS replicas get scheduled on the initial master as it is being configured.
I'm pulling down the brightbox cloud-controller and running that deployment prior to joining the worker nodes, so it's possible that --cloud-provider=external is interfering given the delay in the worker nodes being marked as ready
The CNI is calico and I'm applying that prior to joining the workers.
However from the Terraform traces it all appears to run in parallel.
The test cluster is 3 worker nodes and 3 CP nodes
@kubernetes/sig-cluster-lifecycle
technically if you just have the single primary CP for a while, both coredns pods will just land on it.
i'm not aware of a k8s mechanic that will allow, say, one of the coredns pods to auto-move to another node once more nodes are up. maybe some clever play with taints/tolerations can be done to only have one replica on the primary CP and the other replica can wait for another node..
you can technically just remove those pods and the replication controller will/should reconcile the deployment on the available workers.
but something odd is going on here, because even if the primary CP is removed after all nodes are up, the pods should still schedule elsewhere right away without waiting for --pod-eviction-timeout.
so it's possible that --cloud-provider=external is interfering
it could be cloud provider specific and we don't have test signal for such scenarios.
in any case i don't think that's a kubeadm bug, but possibly something for sig scheduling / sig cloud-provider to investigate.
on the kubeadm side, we can consider as an issue that coredns replicas can land on the same node, yet i cannot reproduce reconciliation issues related to that - i.e. the service remains up.
The DNS service remains up, but it fails to work
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 172.30.0.10
Address 1: 172.30.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
i don't see how this relates to:
When there is more than one node in the cluster the pods should be spread across the nodes.
When the control plane node with both coredns pods on it goes down, the coredns service is rendered unusable until the other two control plane nodes decide to terminate the pods and restart the deployment - seemingly after the pod eviction timeout expires.
If one of the pods is reallocated away from the control plane node (I'm using the rollout restart as above) prior to the node going down then, of course, the coredns service redundancy works as expected.
are you sure this command is working before the primary CP node is removed?
kubectl exec -ti busybox -- nslookup kubernetes.default
locally i cannot get the same command to work on a brand new cluster. see:
https://github.com/kubernetes/kubernetes/issues/66924#issuecomment-411804435
alpine is known to use the "wrong" DNS system library.
is this busybox image alpine based?
If one of the pods is reallocated away from the control plane node (I'm using the rollout restart as above) prior to the node going down then, of course, the coredns service redundancy works as expected.
nope, cannot reproduce this.
kubectl exec -ti debian-stretch -- nslookup kubernetes.default
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.96.0.1
note: debian-stretch needs dnsutils
The command is working. Before and after is as follows
$ kubectl apply -f https://k8s.io/examples/admin/dns/busybox.yaml
pod/busybox created
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 172.30.0.10
Address 1: 172.30.0.10 kube-dns.kube-system.svc.group.local
Name: kubernetes.default
Address 1: 172.30.0.1 kubernetes.default.svc.group.local
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 172.30.0.10
Address 1: 172.30.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
If your setup is rescheduling coredns instantly, then that's the difference.
After I kill the node (ie simulate a hard crash on the master node running the coredns pods) I get
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
srv-1okfc Ready master 7m25s v1.15.0
srv-24ipi NotReady master 9m22s v1.15.0
srv-93pp4 Ready <none> 8m28s v1.15.0
srv-i068b Ready <none> 8m42s v1.15.0
srv-t7si4 Ready <none> 8m30s v1.15.0
srv-y5168 Ready master 7m40s v1.15.0
$ kubectl -n kube-system get pods --selector=k8s-app=kube-dns -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
coredns-5c98db65d4-45k57 1/1 Running 0 9m12s 192.168.155.194 srv-24ipi <none> <none>
coredns-5c98db65d4-qc4fh 1/1 Running 0 9m12s 192.168.155.195 srv-24ipi <none> <none>
for several minutes.
So this is some sort of re-scheduling issue.
Is the setup you are running zone aware from the cloud provider labels (beta.kubernetes.io/instance-type, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone)?
The kube-controller that takes over once the master node is killed logs the following
I0710 03:49:59.381241 1 event.go:258] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"srv-cenne", UID:"5e6aa24c-05b6-4823-9bd0-76b43a95164e", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node srv-cenne status is now: NodeNotReady
I0710 03:55:04.733763 1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: kube-system/coredns-5c98db65d4-g4n9v
I0710 03:55:04.733853 1 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: kube-system/coredns-5c98db65d4-fsq54
This fits with the tolerations on the core-dns pods
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
So I'm now not sure why your core-dns pods are being evicted.
@NeilW - my guess is that pods are evicted due to tolerationSeconds: 300
So this is some sort of re-scheduling issue.
indeed.
Is the setup you are running zone aware from the cloud provider labels (beta.kubernetes.io/instance-type, failure-domain.beta.kubernetes.io/region, failure-domain.beta.kubernetes.io/zone)?
it's a local cluster running the nodes as docker containers (see /kinder in this repo).
The command is working. Before and after is as follows
very odd because someone just confirmed that busybox does not work for them at all.
https://github.com/kubernetes/kubeadm/issues/1659
did you try a different image (not busybox) just as a sanity check for the before and after?
Running
$ kubectl run --generator=run-pod/v1 --image=ubuntu --rm -i --tty my-shell -- bash
If you don't see a command prompt, try pressing enter.
root@my-shell:/# apt-get update
before crashing the master node gives
Get:1 http://archive.ubuntu.com/ubuntu bionic InRelease [242 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Get:4 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
and after you get
Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
Temporary failure resolving 'archive.ubuntu.com'
If you run the rollout restart then it goes back to normal.
Remember that I'm crashing the master node here, not deleting it. It remains in the node list and goes to 'Not Ready'.
what exact method are using to crash the node?
i just "shut down" the node and then did a kubectl delete no, right after.
actually does kubectl delete no affect the situation in your case?
EDIT: i will attempt to retest again.
but i guess it's a question if a node crashes without an operator around, it might fail DNS for a while.
still seems like something that sig-scheduling/sig-node should know about.
I don't delete the node. I'm testing resilience in the face of a failure zone crash and subsequent recovery.
If you redeploy the core-dns service after workers have been added into the cluster there is no interruption when the node is crashed, as the standard zone aware anti-affinity process moves them to two separate nodes.
The way we build clusters with kubeadm (by joining in new nodes to an initial node) tends to leave the coredns pods on a single master node. I'm automating it with terraform and even that doesn't build the nodes fast enough to get a distribution first time, the majority of the time.
@NeilW i found more time to investigate this problem.
what you are looking for is podAntiAffinity to prevent the coredns Pods scheduling on the same nodes and leave one of them Pending until more nodes are available:
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system coredns-d757fdb54-4pz92 0/1 Pending 0 5m52s
kube-system coredns-d757fdb54-m5r69 1/1 Running 0 5m52s
the workaround to your problem is the following:
# call kubeadm init and let kubeadm deploy the default coredns
kubeadm init ...
# patch the coredns deployment
kubectl patch deployment coredns -n kube-system --type merge --patch "$(cat coredns-aa-patch.yaml)"
# apply pod network
...
$ cat coredns-aa-patch.yaml
spec:
template:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- kube-dns
topologyKey: "kubernetes.io/hostname"
by adding more nodes, the second Pod will schedule on another node.
unfortunately we cannot add this as part of the default coredns deployment in kubeadm for an important reason:
documentation reference:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
instead of requiredDuringSchedulingRequiredDuringExecution which is not supported yet, we need something like ignoredDuringSchedulingRequiredDuringExecution. this can result in the Pods scheduling on the same node, but once more nodes are available they will automatically spread.
(but such a anti-affinity category may never be added).
there might be other tricks to manage that currently, but i think k8s may be lacking a feature here.
other links:
https://github.com/kubernetes/kubernetes/issues/12140
https://github.com/entelo/reschedule
https://github.com/kubernetes-incubator/descheduler
https://stackoverflow.com/a/41930830
i'm closing this ticket given the provided workaround, but feel free to continue the discussion.
about:
https://github.com/kubernetes/kubeadm/issues/1657#issuecomment-509888479
So this is some sort of re-scheduling issue.
seems like so, see:
https://github.com/kubernetes/kubernetes/issues/55713
people are saying that --pod-eviction-timeout is not working as expected.
@NeilW
it seems the scheduling behavior that you are seeing is by design.
have a look at the discussion here:
https://github.com/kubernetes/kubernetes/issues/55713#issuecomment-517982549
@neolit123 Would there be any reason to use your antiAffinity workaround over this fix? https://github.com/kubernetes/kops/pull/7400/files
@bcoughlan there are multiple ways to solve the issue.
this is mostly the reason why we haven't decided on how to proceed in kubeadm.
but looking at:
https://github.com/kubernetes/kops/pull/7400/files
we don't have plans to deploy the autoscaler with kubeadm by default.
i think the best option for users right now is to just delete the core-dns pods once multiple nodes are up and this will cause them to re-sched on different nodes.
@neolit123 Thanks. At a closer look I see that the kops file also has the antiAffinity workaround - otherwise when a node is added, the replica count is bumped but the node being added still has taints which causes the coredns pod to be scheduled on the same node again.
So it looks like your workaround is needed either way.
Most helpful comment
@NeilW i found more time to investigate this problem.
what you are looking for is podAntiAffinity to prevent the coredns Pods scheduling on the same nodes and leave one of them Pending until more nodes are available:
the workaround to your problem is the following:
by adding more nodes, the second Pod will schedule on another node.
unfortunately we cannot add this as part of the default coredns deployment in kubeadm for an important reason:
documentation reference:
https://kubernetes.io/docs/concepts/configuration/assign-pod-node/
instead of
requiredDuringSchedulingRequiredDuringExecutionwhich is not supported yet, we need something likeignoredDuringSchedulingRequiredDuringExecution. this can result in the Pods scheduling on the same node, but once more nodes are available they will automatically spread.(but such a anti-affinity category may never be added).
there might be other tricks to manage that currently, but i think k8s may be lacking a feature here.
other links:
https://github.com/kubernetes/kubernetes/issues/12140
https://github.com/entelo/reschedule
https://github.com/kubernetes-incubator/descheduler
https://stackoverflow.com/a/41930830
i'm closing this ticket given the provided workaround, but feel free to continue the discussion.