When updating from k8s 1.5.2 to 1.6.2 via kops 1.6.0, After the first node restarted the kube-dns failed to start and caused kops to exit after timeout using KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate".
It was missing a configmap named kube-dns in the kube-system namespace.
I recovered it by creating an empty config map with kubectl create configmap -n kube-system kube-dns. Then continued the rolling update, which finished without further problems.
Did a new version of kube-dns get installed? Also to confirm this is with the kops release.
Here is the deployment for kube-dns that was created, used kops darwin from the release page on github
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"labels":{"k8s-addon":"kube-dns.addons.k8s.io","k8s-app":"kube-dns","kubernetes.io/cluster-service":"true"},"name":"kube-dns","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"kube-dns"}},"strategy":{"rollingUpdate":{"maxSurge":"10%","maxUnavailable":0}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"kube-dns"}},"spec":{"containers":[{"args":["--domain=cluster.local.","--dns-port=10053","--config-dir=/kube-dns-config","--v=2"],"env":[{"name":"PROMETHEUS_PORT","value":"10055"}],"image":"gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/kubedns","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"kubedns","ports":[{"containerPort":10053,"name":"dns-local","protocol":"UDP"},{"containerPort":10053,"name":"dns-tcp-local","protocol":"TCP"},{"containerPort":10055,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/readiness","port":8081,"scheme":"HTTP"},"initialDelaySeconds":3,"timeoutSeconds":5},"resources":{"limits":{"memory":"170Mi"},"requests":{"cpu":"100m","memory":"70Mi"}},"volumeMounts":[{"mountPath":"/kube-dns-config","name":"kube-dns-config"}]},{"args":["-v=2","-logtostderr","-configDir=/etc/k8s/dns/dnsmasq-nanny","-restartDnsmasq=true","--","-k","--cache-size=1000","--log-facility=-","--server=/cluster.local/127.0.0.1#10053","--server=/in-addr.arpa/127.0.0.1#10053","--server=/in6.arpa/127.0.0.1#10053"],"image":"gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/dnsmasq","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"dnsmasq","ports":[{"containerPort":53,"name":"dns","protocol":"UDP"},{"containerPort":53,"name":"dns-tcp","protocol":"TCP"}],"resources":{"requests":{"cpu":"150m","memory":"20Mi"}},"volumeMounts":[{"mountPath":"/etc/k8s/dns/dnsmasq-nanny","name":"kube-dns-config"}]},{"args":["--v=2","--logtostderr","--probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A","--probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A"],"image":"gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/metrics","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"sidecar","ports":[{"containerPort":10054,"name":"metrics","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"20Mi"}}}],"dnsPolicy":"Default","serviceAccountName":"kube-dns","volumes":[{"configMap":{"name":"kube-dns","optional":true},"name":"kube-dns-config"}]}}}}
creationTimestamp: 2017-03-02T14:38:47Z
generation: 7
labels:
k8s-addon: kube-dns.addons.k8s.io
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
name: kube-dns
namespace: kube-system
resourceVersion: "19759179"
selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-dns
uid: f1d87cfc-ff55-11e6-8f43-0a1b518ebe22
spec:
replicas: 2
selector:
matchLabels:
k8s-app: kube-dns
strategy:
rollingUpdate:
maxSurge: 10%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
"operator":"Exists"}]'
creationTimestamp: null
labels:
k8s-app: kube-dns
spec:
containers:
- args:
- --domain=cluster.local.
- --dns-port=10053
- --config-dir=/kube-dns-config
- --v=2
env:
- name: PROMETHEUS_PORT
value: "10055"
image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthcheck/kubedns
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: kubedns
ports:
- containerPort: 10053
name: dns-local
protocol: UDP
- containerPort: 10053
name: dns-tcp-local
protocol: TCP
- containerPort: 10055
name: metrics
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /readiness
port: 8081
scheme: HTTP
initialDelaySeconds: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
memory: 170Mi
requests:
cpu: 100m
memory: 70Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /kube-dns-config
name: kube-dns-config
- args:
- -v=2
- -logtostderr
- -configDir=/etc/k8s/dns/dnsmasq-nanny
- -restartDnsmasq=true
- --
- -k
- --cache-size=1000
- --log-facility=-
- --server=/cluster.local/127.0.0.1#10053
- --server=/in-addr.arpa/127.0.0.1#10053
- --server=/in6.arpa/127.0.0.1#10053
image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /healthcheck/dnsmasq
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: dnsmasq
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
resources:
requests:
cpu: 150m
memory: 20Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/k8s/dns/dnsmasq-nanny
name: kube-dns-config
- args:
- --v=2
- --logtostderr
- --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
- --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 5
httpGet:
path: /metrics
port: 10054
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
name: sidecar
ports:
- containerPort: 10054
name: metrics
protocol: TCP
resources:
requests:
cpu: 10m
memory: 20Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: Default
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: kube-dns
serviceAccountName: kube-dns
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: kube-dns
name: kube-dns-config
status:
availableReplicas: 2
conditions:
- lastTransitionTime: 2017-05-17T23:16:21Z
lastUpdateTime: 2017-05-17T23:16:21Z
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
observedGeneration: 7
readyReplicas: 2
replicas: 2
updatedReplicas: 2
I actually think this needs to be a bug upstream with the dns repo. There is no reason that we need an empty configmap.
Thoughts?
From @justinsb on slack -
So 1.6 introduces the volume with the optional mount, and dns uses that in 1.6
We see the server is 1.6, and so we install the 1.6 version
But the kubelet nodes are still on 1.5
So if a DNS pod lands on a 1.5 node, it won't be able to mount the configmap
And then, as you say, the rolling-update with drain fails
I only noticed it the other day, and I'm not 100% sure how to solve it TBH
Thats definitely what happen to me. So if an empty configmap is created before upgrading, and then deleted at the end if the data in it is empty, it would work. I don't know if that is something that can even be done. Maybe just put it in the upgrade notes and do it manually?(That feels a bit icky)
I had the exact same problem (missing kube-dns-config configmap) when upgrading a cluster from 1.5.3 to 1.6.2 using the KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" flag, and solved it by editing the deployment of kube-dns and setting optional: true for that configmap in the volume list (the optional flag was there on the previous replicaset):
volumes:
- configMap:
defaultMode: 420
name: kube-dns
optional: true
name: kube-dns-config
After you changed that, did it schedule it on the 1.6 pod?
Just bumped into this issue myself after upgrading from v1.5.2 to v1.6.2. @kurlzor's fix worked for me - thanks.
What I am seeing is that we are allowing the cluster to install kube-dns meant for 1.6 before the nodes are 1.6, which is incorrect.
I think I am running in a related issue: I am upgrading from 1.5.4 to 1.6.6 with the KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" feature flag and the dns setup is very unstable, sometimes blocking the validation of the cluster. I _did_ create a kube-dns configmap before launching the rolling-update.
The behaviour I am observing is that the kube-dns-* pods keep being created and terminated and are never stable for more than 1 minute. What is stranger yet, is that it seems that there is a conflict between 2 systems, because some kube-dns-* pods have 3 containers while other have 4 (they have a healthz container).
KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" feature flag and the dns setup is very unstable, sometimes blocking the validation of the cluster
How is it unstable? Need more details about how the validation and draining is impacting the cluster. It is not installing kube-dns, that is nodeup. Upgrading kube-dns has been problematic.
How do we make the drain and validate more stable? What options did you use with the rolling-update?
BTW you can have the same exact problem without using DrainAndValidateRollingUpdate
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen comment.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale
Just ran into this also
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
Most helpful comment
I had the exact same problem (
missing kube-dns-config configmap) when upgrading a cluster from1.5.3to1.6.2using theKOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate"flag, and solved it by editing the deployment ofkube-dnsand settingoptional: truefor that configmap in the volume list (theoptionalflag was there on the previous replicaset):