Kops: kube-dns issue when upgrading using kops upgrade

Created on 18 May 2017  路  15Comments  路  Source: kubernetes/kops

When updating from k8s 1.5.2 to 1.6.2 via kops 1.6.0, After the first node restarted the kube-dns failed to start and caused kops to exit after timeout using KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate".

It was missing a configmap named kube-dns in the kube-system namespace.

I recovered it by creating an empty config map with kubectl create configmap -n kube-system kube-dns. Then continued the rolling update, which finished without further problems.

arerolling-update lifecyclrotten

Most helpful comment

I had the exact same problem (missing kube-dns-config configmap) when upgrading a cluster from 1.5.3 to 1.6.2 using the KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" flag, and solved it by editing the deployment of kube-dns and setting optional: true for that configmap in the volume list (the optional flag was there on the previous replicaset):

volumes:
- configMap:
    defaultMode: 420
    name: kube-dns
    optional: true
  name: kube-dns-config

All 15 comments

Did a new version of kube-dns get installed? Also to confirm this is with the kops release.

Here is the deployment for kube-dns that was created, used kops darwin from the release page on github

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "2"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"extensions/v1beta1","kind":"Deployment","metadata":{"annotations":{},"labels":{"k8s-addon":"kube-dns.addons.k8s.io","k8s-app":"kube-dns","kubernetes.io/cluster-service":"true"},"name":"kube-dns","namespace":"kube-system"},"spec":{"selector":{"matchLabels":{"k8s-app":"kube-dns"}},"strategy":{"rollingUpdate":{"maxSurge":"10%","maxUnavailable":0}},"template":{"metadata":{"annotations":{"scheduler.alpha.kubernetes.io/critical-pod":"","scheduler.alpha.kubernetes.io/tolerations":"[{\"key\":\"CriticalAddonsOnly\", \"operator\":\"Exists\"}]"},"labels":{"k8s-app":"kube-dns"}},"spec":{"containers":[{"args":["--domain=cluster.local.","--dns-port=10053","--config-dir=/kube-dns-config","--v=2"],"env":[{"name":"PROMETHEUS_PORT","value":"10055"}],"image":"gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/kubedns","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"kubedns","ports":[{"containerPort":10053,"name":"dns-local","protocol":"UDP"},{"containerPort":10053,"name":"dns-tcp-local","protocol":"TCP"},{"containerPort":10055,"name":"metrics","protocol":"TCP"}],"readinessProbe":{"httpGet":{"path":"/readiness","port":8081,"scheme":"HTTP"},"initialDelaySeconds":3,"timeoutSeconds":5},"resources":{"limits":{"memory":"170Mi"},"requests":{"cpu":"100m","memory":"70Mi"}},"volumeMounts":[{"mountPath":"/kube-dns-config","name":"kube-dns-config"}]},{"args":["-v=2","-logtostderr","-configDir=/etc/k8s/dns/dnsmasq-nanny","-restartDnsmasq=true","--","-k","--cache-size=1000","--log-facility=-","--server=/cluster.local/127.0.0.1#10053","--server=/in-addr.arpa/127.0.0.1#10053","--server=/in6.arpa/127.0.0.1#10053"],"image":"gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/healthcheck/dnsmasq","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"dnsmasq","ports":[{"containerPort":53,"name":"dns","protocol":"UDP"},{"containerPort":53,"name":"dns-tcp","protocol":"TCP"}],"resources":{"requests":{"cpu":"150m","memory":"20Mi"}},"volumeMounts":[{"mountPath":"/etc/k8s/dns/dnsmasq-nanny","name":"kube-dns-config"}]},{"args":["--v=2","--logtostderr","--probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A","--probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A"],"image":"gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1","livenessProbe":{"failureThreshold":5,"httpGet":{"path":"/metrics","port":10054,"scheme":"HTTP"},"initialDelaySeconds":60,"successThreshold":1,"timeoutSeconds":5},"name":"sidecar","ports":[{"containerPort":10054,"name":"metrics","protocol":"TCP"}],"resources":{"requests":{"cpu":"10m","memory":"20Mi"}}}],"dnsPolicy":"Default","serviceAccountName":"kube-dns","volumes":[{"configMap":{"name":"kube-dns","optional":true},"name":"kube-dns-config"}]}}}}
  creationTimestamp: 2017-03-02T14:38:47Z
  generation: 7
  labels:
    k8s-addon: kube-dns.addons.k8s.io
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
  name: kube-dns
  namespace: kube-system
  resourceVersion: "19759179"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/deployments/kube-dns
  uid: f1d87cfc-ff55-11e6-8f43-0a1b518ebe22
spec:
  replicas: 2
  selector:
    matchLabels:
      k8s-app: kube-dns
  strategy:
    rollingUpdate:
      maxSurge: 10%
      maxUnavailable: 0
    type: RollingUpdate
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
        scheduler.alpha.kubernetes.io/tolerations: '[{"key":"CriticalAddonsOnly",
          "operator":"Exists"}]'
      creationTimestamp: null
      labels:
        k8s-app: kube-dns
    spec:
      containers:
      - args:
        - --domain=cluster.local.
        - --dns-port=10053
        - --config-dir=/kube-dns-config
        - --v=2
        env:
        - name: PROMETHEUS_PORT
          value: "10055"
        image: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthcheck/kubedns
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: kubedns
        ports:
        - containerPort: 10053
          name: dns-local
          protocol: UDP
        - containerPort: 10053
          name: dns-tcp-local
          protocol: TCP
        - containerPort: 10055
          name: metrics
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            memory: 170Mi
          requests:
            cpu: 100m
            memory: 70Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /kube-dns-config
          name: kube-dns-config
      - args:
        - -v=2
        - -logtostderr
        - -configDir=/etc/k8s/dns/dnsmasq-nanny
        - -restartDnsmasq=true
        - --
        - -k
        - --cache-size=1000
        - --log-facility=-
        - --server=/cluster.local/127.0.0.1#10053
        - --server=/in-addr.arpa/127.0.0.1#10053
        - --server=/in6.arpa/127.0.0.1#10053
        image: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /healthcheck/dnsmasq
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: dnsmasq
        ports:
        - containerPort: 53
          name: dns
          protocol: UDP
        - containerPort: 53
          name: dns-tcp
          protocol: TCP
        resources:
          requests:
            cpu: 150m
            memory: 20Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/k8s/dns/dnsmasq-nanny
          name: kube-dns-config
      - args:
        - --v=2
        - --logtostderr
        - --probe=kubedns,127.0.0.1:10053,kubernetes.default.svc.cluster.local,5,A
        - --probe=dnsmasq,127.0.0.1:53,kubernetes.default.svc.cluster.local,5,A
        image: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 5
          httpGet:
            path: /metrics
            port: 10054
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        name: sidecar
        ports:
        - containerPort: 10054
          name: metrics
          protocol: TCP
        resources:
          requests:
            cpu: 10m
            memory: 20Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: Default
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: kube-dns
      serviceAccountName: kube-dns
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          name: kube-dns
        name: kube-dns-config
status:
  availableReplicas: 2
  conditions:
  - lastTransitionTime: 2017-05-17T23:16:21Z
    lastUpdateTime: 2017-05-17T23:16:21Z
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  observedGeneration: 7
  readyReplicas: 2
  replicas: 2
  updatedReplicas: 2

I actually think this needs to be a bug upstream with the dns repo. There is no reason that we need an empty configmap.

Thoughts?

From @justinsb on slack -
So 1.6 introduces the volume with the optional mount, and dns uses that in 1.6
We see the server is 1.6, and so we install the 1.6 version
But the kubelet nodes are still on 1.5
So if a DNS pod lands on a 1.5 node, it won't be able to mount the configmap
And then, as you say, the rolling-update with drain fails
I only noticed it the other day, and I'm not 100% sure how to solve it TBH

Thats definitely what happen to me. So if an empty configmap is created before upgrading, and then deleted at the end if the data in it is empty, it would work. I don't know if that is something that can even be done. Maybe just put it in the upgrade notes and do it manually?(That feels a bit icky)

I had the exact same problem (missing kube-dns-config configmap) when upgrading a cluster from 1.5.3 to 1.6.2 using the KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" flag, and solved it by editing the deployment of kube-dns and setting optional: true for that configmap in the volume list (the optional flag was there on the previous replicaset):

volumes:
- configMap:
    defaultMode: 420
    name: kube-dns
    optional: true
  name: kube-dns-config

After you changed that, did it schedule it on the 1.6 pod?

Just bumped into this issue myself after upgrading from v1.5.2 to v1.6.2. @kurlzor's fix worked for me - thanks.

What I am seeing is that we are allowing the cluster to install kube-dns meant for 1.6 before the nodes are 1.6, which is incorrect.

I think I am running in a related issue: I am upgrading from 1.5.4 to 1.6.6 with the KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" feature flag and the dns setup is very unstable, sometimes blocking the validation of the cluster. I _did_ create a kube-dns configmap before launching the rolling-update.

The behaviour I am observing is that the kube-dns-* pods keep being created and terminated and are never stable for more than 1 minute. What is stranger yet, is that it seems that there is a conflict between 2 systems, because some kube-dns-* pods have 3 containers while other have 4 (they have a healthz container).

KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate" feature flag and the dns setup is very unstable, sometimes blocking the validation of the cluster

How is it unstable? Need more details about how the validation and draining is impacting the cluster. It is not installing kube-dns, that is nodeup. Upgrading kube-dns has been problematic.

How do we make the drain and validate more stable? What options did you use with the rolling-update?

BTW you can have the same exact problem without using DrainAndValidateRollingUpdate

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

Just ran into this also

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Was this page helpful?
0 / 5 - 0 ratings