Kubeadm: redesign the CoreDNS Pod Deployment in kubeadm

Created on 22 Nov 2019 · 40Comments · Source: kubernetes/kubeadm

Is this a BUG REPORT or FEATURE REQUEST?

/kind feature

What happened?

When deploying a Kubernetes cluster using kubeadm we are creating CoreDNS with a default replica of 2.

When performing the kubeadm init execution, the CoreDNS deployment will be fully scheduled on the only control plane node available at that time. When joining further control plane nodes or worker nodes both CoreDNS instances will remain on the first control plane node where they were scheduled at the time of creation.

What you expected to happen?

This is to gather information about how would we feel about performing the following changes:

Introduce a preferredDuringSchedulingIgnoredDuringExecution anti-affinity rule based on node hostname, so the scheduler will favour nodes where there's not a CoreDNS pod running already.

This on its own doesn't make any difference (also, note that it's preferred and not required: otherwise a single control plane node without workers would never succeed to create the CoreDNS deployment, since requirements would never be met).

The next step then would be that kubeadm when kubeadm join has finished, and the new node is registered, would perform the following checks:

Read the CoreDNS deployment pods from the Kube API. If all pods are running on the same node, perform a kill of one pod, forcing Kubernetes to reschedule (at this point, the preferredDuringSchedulingIgnoredDuringExecution mentioned before will get into the game, and the rescheduled pod will be placed on the new node).

Maybe this is something we don't want to do, as we would be making kubeadm tied to workload specific logic, and with the addon story coming it might not make sense. However, this is something that happens on every kubeadm based deployment, and that will make CoreDNS fail temporarily if the first control plane node goes away in an HA environment until those pods are rescheduled somewhere else. Just because they were initially scheduled on the only control plane node available.

What do you think? cc/ @neolit123 @rosti @fabriziopandini @yastij

google doc proposal
https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

kindesign kinfeature prioritbacklog

Source

ereslibre

👍7 ❤1

Most helpful comment

I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

ereslibre on 29 Nov 2019

👍2

All 40 comments

/priority backlog

ereslibre on 22 Nov 2019

@ereslibre this is a duplicate of:
https://github.com/kubernetes/kubeadm/issues/1657

please read the whole discussion there.

preferredDuringSchedulingIgnoredDuringExecution

this is problematic because it would leave the second replica as pending.
if someone is running the e2e test suite on a single CP node cluster it will fail.

what should be done instead is to make CoreDNS a DS that lands Pods on all CP nodes.
but this will break all users that currently patch the CoreDNS deployment.

neolit123 on 22 Nov 2019

this is problematic because it would leave the second replica as pending.

This should not happen with preferredDuringSchedulingIgnoredDuringExecution, but with requiredDuringSchedulingIgnoredDuringExecution. I can check though, but I'm fairly sure with preferred it will just work even with a single node.

ereslibre on 22 Nov 2019

👍1

I can check though, but I'm fairly sure with preferred it will just work even with a single node.

both replicas will land on the same CP node in the beginning and then then the "join" process has to amend the Deployment?

neolit123 on 22 Nov 2019

Yes, the join process would "kubectl delete" one of the pods through the API, I specify the details on the body of the issue.

ereslibre on 22 Nov 2019

please add this as agenda for next week.
the future coredns operator (as kubeadm will no longer include coredns in the future, by default) will run it as a DaemonSet.

i'm really in favor of breaking the users now and deploying it as DS and adding an action required in the release notes (in e.g. 1.18).

neolit123 on 22 Nov 2019

👍1

/assign

rajansandeep on 22 Nov 2019

@rajansandeep i proposed that we should chat about this in the next office hours on Wednesday.
potentially decide if we want to proceed with changes for 1.18.

neolit123 on 22 Nov 2019

@neolit123 Yes, I agree. Would like to discuss what would be the potential direction for this.

rajansandeep on 23 Nov 2019

👍1

today i played with deploying CoreDNS as a DS with this manifest (based on the existing Deployment manifest). it works fine and targets CP nodes only.

apiVersion: apps/v1
  kind: DaemonSet
  metadata:
    name: {{ .DeploymentName }}
    namespace: kube-system
    labels:
      k8s-app: kube-dns
  spec:
    selector:
      matchLabels:
        k8s-app: kube-dns
    template:
      metadata:
        labels:
          k8s-app: kube-dns
      spec:
        priorityClassName: system-cluster-critical
        serviceAccountName: coredns
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
        - key: {{ .ControlPlaneTaintKey }}
          effect: NoSchedule
        nodeSelector:
          beta.kubernetes.io/os: linux
          {{ .ControlPlaneTaintKey }}: ""
        containers:
        - name: coredns
          image: {{ .Image }}
          imagePullPolicy: IfNotPresent
          resources:
            limits:
              memory: 170Mi
            requests:
              cpu: 100m
              memory: 70Mi
          args: [ "-conf", "/etc/coredns/Corefile" ]
          volumeMounts:
          - name: config-volume
            mountPath: /etc/coredns
            readOnly: true
          ports:
          - containerPort: 53
            name: dns
            protocol: UDP
          - containerPort: 53
            name: dns-tcp
            protocol: TCP
          - containerPort: 9153
            name: metrics
            protocol: TCP
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
              scheme: HTTP
            initialDelaySeconds: 60
            timeoutSeconds: 5
            successThreshold: 1
            failureThreshold: 5
          readinessProbe:
            httpGet:
              path: /ready
              port: 8181
              scheme: HTTP
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              add:
              - NET_BIND_SERVICE
              drop:
              - all
            readOnlyRootFilesystem: true
        dnsPolicy: Default
        volumes:
          - name: config-volume
            configMap:
              name: coredns
              items:
              - key: Corefile
                path: Corefile

i think it's also time to deprecate kube-dns in 1.18 and apply a 3 cycle deprecation policy to it, unless we switch to an addon installer sooner than that.

neolit123 on 24 Nov 2019

my proposal is to second class the big scale clusters (e.g. 5k nodes).
if those have demands for the coredns setup today they need to enable the autoscaler and possibly do other customizations already.

for the common use case the DaemonSet that targets only CP nodes seems fine to me.

there were objections that we should not move to a DS, but i'm not seeing any good ideas on how to continue using a Deployment and to fix the issue outlined in this ticket.

i think @yastij spoke about something similar to this:
https://github.com/kubernetes/kubernetes/pull/85108#discussion_r344985437

yet if i recall correctly, from my tests it didn't work so well.

neolit123 on 27 Nov 2019

👍1

My proposal stands:

Set anti affinity preferredDuringSchedulingIgnoredDuringExecution on CoreDNS.
Default the Deployment to 2 replicas, as it is now.
When a node joins:
- Check the CoreDNS pods: if all of them are running on the same node, kill half of the CoreDNS pods.

This ensures we are not messing with the replica number, and we are only taking action if all CoreDNS pods are scheduled on the same node.

I can put a PR together with a proof of concept if you want to try it.

ereslibre on 27 Nov 2019

Does it make sense for kubeadm to support an autoscaler?

rajansandeep on 27 Nov 2019

Does it make sense for kubeadm to support an autoscaler?

I would twist the question to: should the default deployment with kubeadm make it harder to use an autoscaler?

Answering to your question we are not "supporting" it, but in my opinion we shouldn't make it deliberately harder to use if folks want to.

ereslibre on 27 Nov 2019

https://github.com/kubernetes/kubernetes/pull/85108#discussion_r344985437
yet if i recall correctly, from my tests it didn't work so well.

i'm going to test this again, if this works we don't have to make any other changes.

neolit123 on 27 Nov 2019

👍1

i'm going to test this again, if this works we don't have to make any other changes.

I'm fine if this is where we think kubeadm bounds are.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

Also, we should use preferredDuringSchedulingIgnoredDuringExecution instead of requiredDuringSchedulingIgnoredDuringExecution, for the issues mentioned earlier during the meeting.

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

ereslibre on 27 Nov 2019

In my opinion, for a proper solution we need to force the rescheduling upon a new node join. But I'm fine if that's not the expectation of the broader group.

even if we reschedule if both of those Nodes become NotReady there will be no service.

Note that even with what you propose we will have a period of time in which no CoreDNS pods will be answering internal dns requests if the first control plane goes down.

i guess we can experiment with an aggressive timeout for this too - i.e. considering the Pods for coredns are critical.

i think we might want to outline the different proposals and approaches in a google doc and get some votes going, otherwise this issue will not see any progress.

neolit123 on 27 Nov 2019

I wrote this patch as an alternative: https://github.com/kubernetes/kubernetes/compare/master...ereslibre:even-dns-deployment-on-join, if you think it's reasonable I can open a PR for further discussion.

ereslibre on 27 Nov 2019

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

neolit123 on 27 Nov 2019

so adding the phase means binding more logic to the coredns deployment, yet the idea is to move the coredns deployment outside of kubeadm long term, which means that we have to deprecate the phase or just keep it "experimental".

That is correct, or it could even be a hidden phase.

ereslibre on 27 Nov 2019

yes, hidden phase sounds fine.
still better to enumerate the different options before PRs, though.

neolit123 on 27 Nov 2019

i played with this patch:

diff --git a/cmd/kubeadm/app/phases/addons/dns/manifests.go b/cmd/kubeadm/app/phases/addons/dns/manifests.go
index 0ff61431705..5442b29ed88 100644
--- a/cmd/kubeadm/app/phases/addons/dns/manifests.go
+++ b/cmd/kubeadm/app/phases/addons/dns/manifests.go
@@ -243,6 +243,14 @@ spec:
         operator: Exists
       - key: {{ .ControlPlaneTaintKey }}
         effect: NoSchedule
+      - key: "node.kubernetes.io/unreachable"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
+      - key: "node.kubernetes.io/not-ready"
+        operator: "Exists"
+        effect: "NoExecute"
+        tolerationSeconds: 5
       nodeSelector:
         beta.kubernetes.io/os: linux
       containers:

results:

create a cluster with multiple nodes
"stop" (e.g. shutdown a VM) the primary CP node where coredns landed
the Node becomes NotReady after a few minutes (is there a way to control this?)
5 seconds after the Node is NotReady, the coredns pods on this Node enter Terminating state
the Pods are scheduled on new Nodes
if the primary CP Node is brought back to Ready the Terminating Pods terminate and are deleted.
if the primary CP Node is deleted the same happens.
if no action is taken on the primary CP Node the Pods remain in terminating idefinitely.
(related to https://github.com/kubernetes/kubernetes/issues/51835?)

neolit123 on 28 Nov 2019

I created this document to follow up on the different alternatives and discuss all of them: https://docs.google.com/document/d/1shXvqFwcqH8hJF-tAYNlEnVn45QZChfTjxKzF_D3d8I/edit

ereslibre on 29 Nov 2019

👍2

Sorry for getting into this conversation, but I was just observing the similar behavior when two coredns pods are crammed on single (first) control plane node up until that node is powered off, then pods are migrated after a certain timeout (about 5 minutes I guess), they get more evenly spread after this migration.

What I didn't follow is why coredns is treated differently comparing to other kube master services, like apiserver or proxy or etcd? What is the purpose of having 2 pods on single node? Looks like it is the most important service since it has 2 pods by default, while actually it becomes least fault-tolerant one because of the odd placing.

Why not just make it the same as other services and be done with it, especially if it is that important.

non7top on 7 Jan 2020

/retitle redesign the CoreDNS Pod Deployment in kubeadm

neolit123 on 21 Jan 2020

xref https://github.com/kubernetes/kubeadm/issues/1954

neolit123 on 24 Jan 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 6 Jun 2020

/remove-lifecycle stale

neolit123 on 7 Jun 2020

Isn't there a simpler option to just have some separate configuration for CoreDNS that toggles the toleration so that the pods don't actually get scheduled on the control plane? By default we should keep the current behavior where the pods are scheduled on the control plane. Patching coredns deployment everytime seems painful.

MirzaSikander on 25 Jun 2020

running coredns on worker nodes is not a good practice, because is some cases the workers are ephemeral.

kubeadm has plans to move the coredns deployment to be managed by an operator, so potentially this can be made configurable in that process, but we are facing some different complexities there.

in the meantime there are two options:

patch the Deployment that kubeadm installs (which most people are doing today).
skip coredns on kubeadm init and install it manually with a custom Deployment or DaemonSet.

neolit123 on 25 Jun 2020

Okay thanks for the quick response. Just wanted to add that even in the case where workers are ephemeral having 2 or more replicas should prevent any loss of DNS resolution. Also this really just depends upon the circumstances so having it configurable helps.

MirzaSikander on 25 Jun 2020

@neolit123

patch the Deployment that kubeadm installs (which most people are doing today)

Will the patch be persistent or what will be done on on kubeadm upgrade ?

For the other discussion, also have a look at the discussion in https://github.com/coredns/deployment/pull/206

rdxmb on 2 Sep 2020

@rdxmb
kubeadm imports a coredns migration library but that works only on the Corefile in the coredns ConfigMap during upgrade.
you still have to patch the Deployment or outer ConfigMap.

the only thing that is preserved in the Deployment i believe is the number of replicas:
https://github.com/kubernetes/kubeadm/issues/1954
https://github.com/kubernetes/kubernetes/pull/85837

neolit123 on 2 Sep 2020

@neolit123 thanks!

rdxmb on 2 Sep 2020

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 1 Dec 2020

/remove-lifecycle stale

rdxmb on 1 Dec 2020

Coredns gets scheduled in workernode after one of control plane node comes down, although it doesnt get ready in the worker node. to fix that it has to be manually cordoned off the worker node which shouldn't be an ideal scenario. it should have ideally not attempted to be schedule in non control plane nodes.

srahulkumar84 on 1 Jan 2021

😕1

Scheduling on worker nodes works fine.

Depends. Targetting only Cp nodes was discussed as not ideal either. For
customizing that, you could patch the coredns deployment or skip kubeadm
deploying corends and do that yourself.

neolit123 on 4 Jan 2021

👍1

Does descheduler make sense in this case ?
If we add a negative descheduler logic for coredns(for instance create a crd to declare which deployment need to be descheduled), it would be nice to have. Hence, descheduler will handle similar things on coredns or other applications.