/kind feature
When deploying a Kubernetes cluster using kubeadm we are creating CoreDNS with a default replica of 2.
When performing the kubeadm init execution, the CoreDNS deployment will be fully scheduled on the only control plane node available at that time. When joining further control plane nodes or worker nodes both CoreDNS instances will remain on the first control plane node where they were scheduled at the time of creation.
This is to gather information about how would we feel about performing the following changes:
preferredDuringSchedulingIgnoredDuringExecution anti-affinity rule based on node hostname, so the scheduler will favour nodes where there's not a CoreDNS pod running already.This on its own doesn't make any difference (also, note that it's preferred and not required: otherwise a single control plane node without workers would never succeed to create the CoreDNS deployment, since requirements would never be met).
The next step then would be that kubeadm when kubeadm join has finished, and the new node is registered, would perform the following checks:
CoreDNS deployment pods from the Kube API. If all pods are running on the same node, perform a kill of one pod, forcing Kubernetes to reschedule (at this point, the preferredDuringSchedulingIgnoredDuringExecution mentioned before will get into the game, and the rescheduled pod will be placed on the new node).Maybe this is something we don't want to do, as we would be making kubeadm tied to workload specific logic, and with the addon story coming it might not make sense. However, this is something that happens on every kubeadm based deployment, and that will make CoreDNS fail temporarily if the first control plane node goes away in an HA environment until those pods are rescheduled somewhere else. Just because they were initially scheduled on the only control plane node available.
What do you think? cc/ @neolit123 @rosti @fabriziopandini @yastij
google doc proposal
https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit
/priority backlog
@ereslibre this is a duplicate of:
https://github.com/kubernetes/kubeadm/issues/1657
please read the whole discussion there.
preferredDuringSchedulingIgnoredDuringExecution
this is problematic because it would leave the second replica as pending.
if someone is running the e2e test suite on a single CP node cluster it will fail.
what should be done instead is to make CoreDNS a DS that lands Pods on all CP nodes.
but this will break all users that currently patch the CoreDNS deployment.
this is problematic because it would leave the second replica as pending.
This should not happen with preferredDuringSchedulingIgnoredDuringExecution, but with requiredDuringSchedulingIgnoredDuringExecution. I can check though, but I'm fairly sure with preferred it will just work even with a single node.
I can check though, but I'm fairly sure with preferred it will just work even with a single node.
both replicas will land on the same CP node in the beginning and then then the "join" process has to amend the Deployment?
Yes, the join process would "kubectl delete" one of the pods through the API, I specify the details on the body of the issue.
please add this as agenda for next week.
the future coredns operator (as kubeadm will no longer include coredns in the future, by default) will run it as a DaemonSet.
i'm really in favor of breaking the users now and deploying it as DS and adding an action required in the release notes (in e.g. 1.18).
/assign
@rajansandeep i proposed that we should chat about this in the next office hours on Wednesday.
potentially decide if we want to proceed with changes for 1.18.
@neolit123 Yes, I agree. Would like to discuss what would be the potential direction for this.
today i played with deploying CoreDNS as a DS with this manifest (based on the existing Deployment manifest). it works fine and targets CP nodes only.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: {{ .DeploymentName }}
namespace: kube-system
labels:
k8s-app: kube-dns
spec:
selector:
matchLabels:
k8s-app: kube-dns
template:
metadata:
labels:
k8s-app: kube-dns
spec:
priorityClassName: system-cluster-critical
serviceAccountName: coredns
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: {{ .ControlPlaneTaintKey }}
effect: NoSchedule
nodeSelector:
beta.kubernetes.io/os: linux
{{ .ControlPlaneTaintKey }}: ""
containers:
- name: coredns
image: {{ .Image }}
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
args: [ "-conf", "/etc/coredns/Corefile" ]
volumeMounts:
- name: config-volume
mountPath: /etc/coredns
readOnly: true
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- containerPort: 9153
name: metrics
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /ready
port: 8181
scheme: HTTP
securityContext:
allowPrivilegeEscalation: false
capabilities:
add:
- NET_BIND_SERVICE
drop:
- all
readOnlyRootFilesystem: true
dnsPolicy: Default
volumes:
- name: config-volume
configMap:
name: coredns
items:
- key: Corefile
path: Corefile
i think it's also time to deprecate kube-dns in 1.18 and apply a 3 cycle deprecation policy to it, unless we switch to an addon installer sooner than that.
my proposal is to second class the big scale clusters (e.g. 5k nodes).
if those have demands for the coredns setup today they need to enable the autoscaler and possibly do other customizations already.
for the common use case the DaemonSet that targets only CP nodes seems fine to me.
there were objections that we should not move to a DS, but i'm not seeing any good ideas on how to continue using a Deployment and to fix the issue outlined in this ticket.
i think @yastij spoke about something similar to this:
https://github.com/kubernetes/kubernetes/pull/85108#discussion_r344985437
yet if i recall correctly, from my tests it didn't work so well.
My proposal stands:
preferredDuringSchedulingIgnoredDuringExecution on CoreDNS.This ensures we are not messing with the replica number, and we are only taking action if all CoreDNS pods are scheduled on the same node.
I can put a PR together with a proof of concept if you want to try it.
Does it make sense for kubeadm to support an autoscaler?
Does it make sense for kubeadm to support an autoscaler?
I would twist the question to: should the default deployment with kubeadm make it harder to use an autoscaler?
Answering to your question we are not "supporting" it, but in my opinion we shouldn't make it deliberately harder to use if folks want to.
https://github.com/kubernetes/kubernetes/pull/85108#discussion_r344985437
yet if i recall correctly, from my tests it didn't work so well.
i'm going to test this again, if this works we don't have to make any other changes.
i'm going to test this again, if this works we don't have to make any other changes.
I'm fine if this is where we think kubeadm bounds are.
Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.
Also, we should use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, for the issues mentioned earlier during the meeting.
In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.
In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.
even if we reschedule if both of those Nodes become NotReady there will be no service.
Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.
i guess we can experiment with an aggressive timeout for this too - i.e. considering the Pods for coredns are critical.
i think we might want to outline the different proposals and approaches in a google doc and get some votes going, otherwise this issue will not see any progress.
I wrote this patch as an alternative: https://github.com/kubernetes/kubernetes/compare/master...ereslibre:even-dns-deployment-on-join, if you think it's reasonable I can open a PR for further discussion.
so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".
so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".
That is correct, or it could even be a hidden phase.
yes, hidden phase sounds fine.
still better to enumerate the different options before PRs, though.
i played with this patch:
diff --git a/cmd/kubeadm/app/phases/addons/dns/manifests.go b/cmd/kubeadm/app/phases/addons/dns/manifests.go
index 0ff61431705..5442b29ed88 100644
--- a/cmd/kubeadm/app/phases/addons/dns/manifests.go
+++ b/cmd/kubeadm/app/phases/addons/dns/manifests.go
@@ -243,6 +243,14 @@ spec:
operator: Exists
- key: {{ .ControlPlaneTaintKey }}
effect: NoSchedule
+ - key: "node.kubernetes.io/unreachable"
+ operator: "Exists"
+ effect: "NoExecute"
+ tolerationSeconds: 5
+ - key: "node.kubernetes.io/not-ready"
+ operator: "Exists"
+ effect: "NoExecute"
+ tolerationSeconds: 5
nodeSelector:
beta.kubernetes.io/os: linux
containers:
results:
I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit
Sorry for getting into this conversation, but I was just observing the similar behavior when two coredns pods are crammed on single (first) control plane node up until that node is powered off, then pods are migrated after a certain timeout (about 5 minutes I guess), they get more evenly spread after this migration.
What I didn't follow is why coredns is treated differently comparing to other kube master services, like apiserver or proxy or etcd? What is the purpose of having 2 pods on single node? Looks like it is the most important service since it has 2 pods by default, while actually it becomes least fault-tolerant one because of the odd placing.
Why not just make it the same as other services and be done with it, especially if it is that important.
/retitle redesign the CoreDNS Pod Deployment in kubeadm
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Isn't there a simpler option to just have some separate configuration for CoreDNS that toggles the toleration so that the pods don't actually get scheduled on the control plane? By default we should keep the current behavior where the pods are scheduled on the control plane. Patching coredns deployment everytime seems painful.
running coredns on worker nodes is not a good practice, because is some cases the workers are ephemeral.
kubeadm has plans to move the coredns deployment to be managed by an operator, so potentially this can be made configurable in that process, but we are facing some different complexities there.
in the meantime there are two options:
Okay thanks for the quick response. Just wanted to add that even in the case where workers are ephemeral having 2 or more replicas should prevent any loss of DNS resolution. Also this really just depends upon the circumstances so having it configurable helps.
@neolit123
patch the Deployment that kubeadm installs (which most people are doing today)
Will the patch be persistent or what will be done on on kubeadm upgrade ?
For the other discussion, also have a look at the discussion in https://github.com/coredns/deployment/pull/206
@rdxmb
kubeadm imports a coredns migration library but that works only on the Corefile in the coredns ConfigMap during upgrade.
you still have to patch the Deployment or outer ConfigMap.
the only thing that is preserved in the Deployment i believe is the number of replicas:
https://github.com/kubernetes/kubeadm/issues/1954
https://github.com/kubernetes/kubernetes/pull/85837
@neolit123 thanks!
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/remove-lifecycle stale
Coredns gets scheduled in workernode after one of control plane node comes down, although it doesnt get ready in the worker node. to fix that it has to be manually cordoned off the worker node which shouldn't be an ideal scenario. it should have ideally not attempted to be schedule in non control plane nodes.
Scheduling on worker nodes works fine.
Depends. Targetting only Cp nodes was discussed as not ideal either. For
customizing that, you could patch the coredns deployment or skip kubeadm
deploying corends and do that yourself.
Does descheduler make sense in this case ?
If we add a negative descheduler logic for coredns(for instance create a crd to declare which deployment need to be descheduled), it would be nice to have. Hence, descheduler will handle similar things on coredns or other applications.
i think we should move the coredns deployment to the addon operator eventually:
https://github.com/kubernetes-sigs/cluster-addons/tree/master/coredns
kops is in the process of implementing it. after that we can investigate for kubeadm.
Most helpful comment
I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit