Kops: What is the correct way to update cluster from 1.11 to 1.12

Created on 21 May 2019 · 20Comments · Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.
Version 1.12.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
1.11.9
3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
kops edit cluster modify the kubernetesVersion from 1.11.9 to 1.12.7
and then
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --yes
5. What happened after the commands executed?
Unable to connect to the server: EOF, cannot connect to API server. Found the nodes in the API ELB are in OutOfService status.
6. What did you expect to happen?
Update successfully.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-name
spec:
  kubelet:
    anonymousAuth: false
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-state-store
  dnsZone: xxxxxx
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.9
  masterInternalName: xxxxxx
  masterPublicName: xxxxxx
  networkCIDR: 10.8.0.0/16
  networkID: vpc-xxxxxx
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.8.1.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1a
    type: Private
    zone: ap-southeast-1a
  - cidr: 10.8.3.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1b
    type: Private
    zone: ap-southeast-1b
  - cidr: 10.8.5.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1c
    type: Private
    zone: ap-southeast-1c
  - cidr: 10.8.2.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1a
    type: Utility
    zone: ap-southeast-1a
  - cidr: 10.8.4.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1b
    type: Utility
    zone: ap-southeast-1b
  - cidr: 10.8.6.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1c
    type: Utility
    zone: ap-southeast-1c
  topology:
    bastion:
      bastionPublicName: xxxxxx
    dns:
      type: Private
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
After tried the above rolling update and fails update. I tried to modify the etcd related part of y cluster definition, and moved back one subversion (1.12.7 -> 12.6), in order to trigger rolling update , and it fails again.

 etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: main
    version: 3.2.24
  - enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: events
    version: 3.2.24

lifecyclrotten

Source

hustshawn

Most helpful comment

Okey I manage to do it. So here are my steps:

Backup 6 ETCD storage using snapshot
Upgrade to manager Manager

kops set cluster cluster.spec.etcdClusters[*].provider=manager
kops update cluster —yes
# This will upgrade master all at once
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s
# This will upgrade nodes progressively
kops rolling-update cluster --yes

PS: This will change your etcd to etcd3 directly!

Backup 6 ETCD storage using snapshot
Change to 1.12.9 using kops edit cluster
- Edit your instance groups with kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019-08-16

kops edit instancegroups master-xxx
kops edit instancegroup nodes
kops update cluster —yes

Upgrade master (will cause downtime - this was the tricky for me... somehow doing a cloud upgrade only on masters was causing lots of issues)

kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

Do the upgrade normally to 1.13 if you wish.

flmmartins on 2 Oct 2019

❤5

All 20 comments

I got the same issue.
After investigating, I saw that the problem is the docker service was not run.
I googled and ran the command "sudo groupadd docker" then docker service starts properly.
Please help check the kops version 12.x which regarding adding the task and running the task.

phungle on 21 May 2019

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

in 1.11, adopt etcd-manager and update
in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1
upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

hustshawn on 21 May 2019

I am more worried about the output error you got after executing kops rolling-update.
Unable to connect to the server: EOF, cannot connect to API server. Found the nodes in the API ELB are in OutOfService status. that means that your backend instances (nodes) in your ELB in AWS are out of service.

nzoueidi on 23 May 2019

@nzoueidi Yes, you are right. The upgrade from 1.11 to 1.12 is no doubt a disruptive process. Since as the doc here indicates, it set the master-interval to 1s, which requires upgrade the masters in a batch, the unable to connect to server will always happens even if a successful upgrade. But the issue here is the API server cannot get back forever.

hustshawn on 23 May 2019

Did anyone have a followup to this issue? What would be the correct way?

flmmartins on 17 Jul 2019

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

in 1.11, adopt etcd-manager and update

in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1

upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

Hi @flmmartins , you can follow the process I mentioned, it has been verified working.

hustshawn on 18 Jul 2019

@hustshawn Any update on this?

I am experiencing the same issue. I can upgrade from k8s 1.11 to k8s 1.12 as long as I pin the config to etcd2. This isn't ideal since etcd2 support is deprecated in k8s 1.12 and dropped in k8s 1.13.

Has anyone successfully upgraded from k8s 1.11, etcd2 to k8s 1.12, etcd3? If so, what were the steps?

asmith60 on 30 Jul 2019

👍2

@asmith60 we've successfully done the upgrade in 3 environments. We followed the normal upgrade process but rolled the masters all at once.

cameronattard on 31 Jul 2019

👍1

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

in 1.11, adopt etcd-manager and update

in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1

upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

The weirdst thing happened to me.
I added added the Manager. For my surprise the etcd went to 3.2.24 (even though I just added the manager flag) and it worked in 1.11.

Then I decided to continue with the course of action and go to etcd 2.2.1 in 1.11 but then everything broke. It was kaos all over the place.

Then I reverted to etcd 3.xxxx. Even though I was on 1.11 it failed.

Then I decided to jump to Kubernetes 1.13, went to 1.12 and nothing worked

The only setup that worked with me so far was etcd 3.2.4 with 1.11.

flmmartins on 2 Oct 2019

Okey I manage to do it. So here are my steps:

Backup 6 ETCD storage using snapshot
Upgrade to manager Manager

kops set cluster cluster.spec.etcdClusters[*].provider=manager
kops update cluster —yes
# This will upgrade master all at once
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s
# This will upgrade nodes progressively
kops rolling-update cluster --yes

PS: This will change your etcd to etcd3 directly!

Backup 6 ETCD storage using snapshot
Change to 1.12.9 using kops edit cluster
- Edit your instance groups with kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019-08-16

kops edit instancegroups master-xxx
kops edit instancegroup nodes
kops update cluster —yes

Upgrade master (will cause downtime - this was the tricky for me... somehow doing a cloud upgrade only on masters was causing lots of issues)

kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

Do the upgrade normally to 1.13 if you wish.

flmmartins on 2 Oct 2019

❤5

The upgraded suggested by @flmmartins above worked for me, so thanks for sharing!

However, beware, if you spin your cluster down and back up again (as many people do in testing environments), you will run into the issue described in #6605. The steps in the guide here will alleviate the issue for as long as the cluster is alive. But, as soon as you spin your cluster down at the end of the day to save :moneybag: and then try to spin it up the next day, the same issue will reoccur and you'll have to repeat the steps. I am testing different versions of k8s and Kops, as well as some good 'ole scripting, to see if I can find a good way around it, and will be documenting my progress in #7414 .

austinorth on 9 Oct 2019

Hey @austinorth,

We actually never shutdown our dev cluster and we also did this procedures in production but knowing this is very great!
Once thing I noticed is that since the upgrade I'm facing issues during the execution of kops rolling-update. The cluster gets validated and is fine but Kops cannot detect it as valid after the creation of the new machine(s) and throws:

Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes

Then I have to run the rolling-update again and again for each machine which is annoying. Since I'm using non-approved version of Kubernetes 1.13.11 (which is not sanctioned by Kops yet) I thought it might be because of that so I didn't even investigate further...

Do you think this issue relates with yours? I will do further investigations once I have time and maybe open an issue.

flmmartins on 8 Nov 2019

Hey @flmmartins,

I've never run into that myself :thinking:. That minor version difference does sound like a probable cause, though, as every time I've had issues with the kops rolling-update command, it has been because of a difference between the Kops client and the Kubernetes version. We just made the jump to Kubernetes 1.14, for the record, which went super smoothly. Definitely recommend.

austinorth on 12 Nov 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Feb 2020

Hey @austinorth I saw your issues and your problems seems to be related due to your applying Kops with Terraform right? That's what I could understand at least....
I will upgrade to 1.15 and would like to try shutting down the machines to save money.

flmmartins on 12 Feb 2020

@flmmartins hey! I have not tested without applying with Terraform, but the root cause of my issue and how I solved it is described here if you would like to take a look: https://github.com/kubernetes/kops/issues/7414#issuecomment-540835879 Ended up being an etcd thing.

austinorth on 12 Feb 2020

I upgraded to 1.15.9 and now I can shutdown the machines again to save cash without issue ^^

flmmartins on 26 Feb 2020

🎉1

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot on 27 Mar 2020

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot on 26 Apr 2020

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.