Kops: Restarting master node causes cluster outage

Created on 16 Jan 2019 · 11Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.

Version 1.11.0 (git-2c2042465)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.11.6

3. What cloud provider are you using?

4. What commands did you run? What is the simplest way to reproduce this issue?

kops rolling-update cluster --yes --instance-group=master-eu-central-1b

5. What happened after the commands executed?
The rolling update of the master node completed successfully. However, during the 2/3 minutes provisioning a new node, there was a complete cluster outage.

6. What did you expect to happen?
Everything continued to operate as normal.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: 2017-12-19T15:09:49Z
  name: [REDACTED]
spec:
  additionalPolicies:
    node: |
      [{"Effect": "Allow","Action": ["route53:ChangeResourceRecordSets"],"Resource": ["arn:aws:route53:::hostedzone/*"]},{"Effect": "Allow","Action": ["route53:ListHostedZones","route53:ListResourceRecordSets"],"Resource": ["*"]}]
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: [REDACTED]
  docker:
    version: 18.06.1
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-eu-central-1b
      name: b
    name: main
  - etcdMembers:
    - instanceGroup: master-eu-central-1b
      name: b
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    oidcClientID: [REDACTED]
    oidcIssuerURL: https://accounts.google.com
    oidcUsernameClaim: email
  kubeDNS:
    provider: CoreDNS
  kubelet:
    anonymousAuth: false
  kubernetesApiAccess:
  - [REDACTED]
  kubernetesVersion: 1.11.6
  masterInternalName: [REDACTED]
  masterPublicName: [REDACTED]
  networkCIDR: [REDACTED]
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: [REDACTED]
  sshAccess:
  - [REDACTED]
  subnets:
  - cidr: [REDACTED]
    name: eu-central-1b
    type: Private
    zone: eu-central-1b
  - cidr: [REDACTED]
    name: utility-eu-central-1b
    type: Utility
    zone: eu-central-1b
  - cidr: [REDACTED]
    name: public-eu-central-1b
    type: Public
    zone: eu-central-1b
  topology:
    bastion:
      bastionPublicName: [REDACTED]
    dns:
      type: Public
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
It is my understanding that restarting the master node should have zero downtime.

Here is an example deployment of one of our apps that was impacted during the upgrade:

apiVersion: v1
kind: Service
metadata:
  name: eventd
  labels:
    app: eventd
  annotations:
    external-dns.alpha.kubernetes.io/hostname: [REDACTED]
    service.beta.kubernetes.io/aws-load-balancer-ssl-cert: [REDACTED]
    service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "443"
    service.beta.kubernetes.io/aws-load-balancer-backend-protocol: "http"
spec:
  ports:
  - name: http
    protocol: TCP
    port: 80
    targetPort: 6000
  - name: https
    protocol: TCP
    port: 443
    targetPort: 6000
  type: LoadBalancer
  selector:
    app: eventd
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: eventd
  labels:
    app: eventd
    build: stable
spec:
  replicas: 3
  revisionHistoryLimit: 2
  minReadySeconds: 30
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: eventd
  template:
    metadata:
      labels:
        app: eventd
        build: stable
    spec:
      terminationGracePeriodSeconds: 300
      imagePullSecrets:
      - name: registrykey
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - eventd
            topologyKey: kubernetes.io/hostname
      nodeSelector:
        subnetType: private
      containers:
      - name: eventd
        image: avocet/services
        resources:
          requests:
            cpu: 1
            memory: 4G
        imagePullPolicy: Always
        command:
        - eventd 
        ports:
        - containerPort: 6000
        livenessProbe:
          httpGet:
            path: /healthz
            port: 6000
          initialDelaySeconds: 90
          periodSeconds: 3
        volumeMounts:
        - name: ssl-certs
          mountPath: /etc/ssl/certs/ca-certificates.crt
          readOnly: true
      volumes:
      - name: ssl-certs
        hostPath:
          path: /etc/ssl/certs/ca-certificates.crt

lifecyclrotten

Source

syscll

👍1

Most helpful comment

We are also experiencing similar issue, we use kops 1.12.2 and k8s 1.12.9.

From my investigation, what happens is that when new master comes up, all nodes become NotReady for short time, about a minute (until they reconnect to new master), during this time controller-manager thinks that there are no Ready nodes in the cluster and removes all current nodes from the Cloud Provider's LoadBalancer.

When nodes start reconnect to new master, they get re-added back to Provider's LoadBalancer by controller-manager.

Here are events from one of the service, as you can see there was some duration when there is no available nodes and after 1 minute all back to normal..

Type     Reason                   Age   From                Message
----     ------                   ----  ----                -------
Normal   EnsuringLoadBalancer     17m   service-controller  Ensuring load balancer
Normal   EnsuredLoadBalancer      17m   service-controller  Ensured load balancer
Warning  UnAvailableLoadBalancer  15m   service-controller  There are no available nodes for LoadBalancer service default/hello-world
Normal   UpdatedLoadBalancer      14m   service-controller  Updated load balancer with new hosts

Basically this means restarting master node is not safe anymore. Not sure when it started for me, but I never experienced this problem prior to upgrading kops/k8s to 1.12.x

I couldn't find related issues in k8s github and not here, but it's definitely not a desired behavior.
@justinsb @chrislovecnm need you attention and advise here, thanks

shamil on 21 Jun 2019

👍2

All 11 comments

After doing some further investigation, it looks like all of the instances attached to an ELB in the cluster are temporarily removed when restarting the master node.

I can see several metrics and health checks in AWS stop reporting during this brief outage that usually lasts around 2-3 minutes.

Could this be something to do with the dns-controller?

syscll on 17 Jan 2019

I have seen the same thing when using CoreDNS.
Looks like CoreDNS fails to talk connect to the K8s API, which in turn makes DNS lookups fail.

olemarkus on 21 Mar 2019

👍1

@olemarkus Unfortunately I've also experienced this with clusters just running kube-dns.

syscll on 21 Mar 2019

👍1

I've seen this with kube-dns too.

yurrriq on 21 Mar 2019

I didn't mean to suggest it was CoreDNS that was the culprit. We observed this with kube-dns too (which took even longer to recover from).

olemarkus on 22 Mar 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 20 Jun 2019

We are also experiencing similar issue, we use kops 1.12.2 and k8s 1.12.9.

When nodes start reconnect to new master, they get re-added back to Provider's LoadBalancer by controller-manager.

Here are events from one of the service, as you can see there was some duration when there is no available nodes and after 1 minute all back to normal..

Type     Reason                   Age   From                Message
----     ------                   ----  ----                -------
Normal   EnsuringLoadBalancer     17m   service-controller  Ensuring load balancer
Normal   EnsuredLoadBalancer      17m   service-controller  Ensured load balancer
Warning  UnAvailableLoadBalancer  15m   service-controller  There are no available nodes for LoadBalancer service default/hello-world
Normal   UpdatedLoadBalancer      14m   service-controller  Updated load balancer with new hosts

Basically this means restarting master node is not safe anymore. Not sure when it started for me, but I never experienced this problem prior to upgrading kops/k8s to 1.12.x

I couldn't find related issues in k8s github and not here, but it's definitely not a desired behavior.
@justinsb @chrislovecnm need you attention and advise here, thanks

shamil on 21 Jun 2019

👍2

I was able to solve this, by using loadBalancer for Master API (internal domain)

In kops cluster spec, replace from:

api:
  dns: {}

to:

api:
  loadBalancer:
    type: Internal
    useForInternalApi: true

then remove both internal and external old DNS A records from route53 and terraform apply (If you use terraform)

I guess using dns approach is not safe...

shamil on 21 Jun 2019

@shamil RE: https://github.com/kubernetes/kops/issues/6349#issuecomment-504414022

This is exact same behaviour we are experiencing. The only way I've been able to mitigate this so far is to pass the --cloudonly flag when doing a kops rolling-update on the master node.

syscll on 22 Jun 2019

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 22 Jul 2019

Didn't get anywhere with this, and the only solution I found was to do this: https://github.com/kubernetes/kops/issues/6349#issuecomment-504657476

syscll on 24 Jul 2019

Was this page helpful?

0 / 5 - 0 ratings