Kops: What is the correct way to update cluster from 1.11 to 1.12

Created on 21 May 2019  路  20Comments  路  Source: kubernetes/kops

1. What kops version are you running? The command kops version, will display
this information.

Version 1.12.1
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.11.9
3. What cloud provider are you using?
aws
4. What commands did you run? What is the simplest way to reproduce this issue?
kops edit cluster modify the kubernetesVersion from 1.11.9 to 1.12.7
and then
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s --yes
5. What happened after the commands executed?
Unable to connect to the server: EOF, cannot connect to API server. Found the nodes in the API ELB are in OutOfService status.
6. What did you expect to happen?
Update successfully.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops/v1alpha2
kind: Cluster
metadata:
  name: my-cluster-name
spec:
  kubelet:
    anonymousAuth: false
  api:
    loadBalancer:
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: s3://my-state-store
  dnsZone: xxxxxx
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    admissionControl:
    - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - DefaultTolerationSeconds
    - MutatingAdmissionWebhook
    - ValidatingAdmissionWebhook
    - ResourceQuota
    - NodeRestriction
    - Priority
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.11.9
  masterInternalName: xxxxxx
  masterPublicName: xxxxxx
  networkCIDR: 10.8.0.0/16
  networkID: vpc-xxxxxx
  networking:
    weave:
      mtu: 8912
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.8.1.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1a
    type: Private
    zone: ap-southeast-1a
  - cidr: 10.8.3.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1b
    type: Private
    zone: ap-southeast-1b
  - cidr: 10.8.5.0/24
    id: subnet-xxxxxx
    name: ap-southeast-1c
    type: Private
    zone: ap-southeast-1c
  - cidr: 10.8.2.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1a
    type: Utility
    zone: ap-southeast-1a
  - cidr: 10.8.4.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1b
    type: Utility
    zone: ap-southeast-1b
  - cidr: 10.8.6.0/24
    id: subnet-xxxxxx
    name: utility-ap-southeast-1c
    type: Utility
    zone: ap-southeast-1c
  topology:
    bastion:
      bastionPublicName: xxxxxx
    dns:
      type: Private
    masters: private
    nodes: private

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?
After tried the above rolling update and fails update. I tried to modify the etcd related part of y cluster definition, and moved back one subversion (1.12.7 -> 12.6), in order to trigger rolling update , and it fails again.

 etcdClusters:
  - enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: main
    version: 3.2.24
  - enableEtcdTLS: true
    etcdMembers:
    - instanceGroup: master-ap-southeast-1a
      name: a
    - instanceGroup: master-ap-southeast-1b
      name: b
    - instanceGroup: master-ap-southeast-1c
      name: c
    name: events
    version: 3.2.24
lifecyclrotten

Most helpful comment

Okey I manage to do it. So here are my steps:

  1. Backup 6 ETCD storage using snapshot

  2. Upgrade to manager Manager

kops set cluster cluster.spec.etcdClusters[*].provider=manager
kops update cluster 鈥攜es
# This will upgrade master all at once
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s
# This will upgrade nodes progressively
kops rolling-update cluster --yes

PS: This will change your etcd to etcd3 directly! 
  1. Backup 6 ETCD storage using snapshot

  2. Change to 1.12.9 using kops edit cluster

    • Edit your instance groups with kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019-08-16
kops edit instancegroups master-xxx
kops edit instancegroup nodes
kops update cluster 鈥攜es

  1. Upgrade master (will cause downtime - this was the tricky for me... somehow doing a cloud upgrade only on masters was causing lots of issues)

kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

  1. Do the upgrade normally to 1.13 if you wish.

All 20 comments

I got the same issue.
After investigating, I saw that the problem is the docker service was not run.
I googled and ran the command "sudo groupadd docker" then docker service starts properly.
Please help check the kops version 12.x which regarding adding the task and running the task.

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

  1. in 1.11, adopt etcd-manager and update
  2. in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1
  3. upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

I am more worried about the output error you got after executing kops rolling-update.
Unable to connect to the server: EOF, cannot connect to API server. Found the nodes in the API ELB are in OutOfService status. that means that your backend instances (nodes) in your ELB in AWS are out of service.

@nzoueidi Yes, you are right. The upgrade from 1.11 to 1.12 is no doubt a disruptive process. Since as the doc here indicates, it set the master-interval to 1s, which requires upgrade the masters in a batch, the unable to connect to server will always happens even if a successful upgrade. But the issue here is the API server cannot get back forever.

Did anyone have a followup to this issue? What would be the correct way?

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

  1. in 1.11, adopt etcd-manager and update
  2. in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1
  3. upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

Hi @flmmartins , you can follow the process I mentioned, it has been verified working.

@hustshawn Any update on this?

I am experiencing the same issue. I can upgrade from k8s 1.11 to k8s 1.12 as long as I pin the config to etcd2. This isn't ideal since etcd2 support is deprecated in k8s 1.12 and dropped in k8s 1.13.

Has anyone successfully upgraded from k8s 1.11, etcd2 to k8s 1.12, etcd3? If so, what were the steps?

@asmith60 we've successfully done the upgrade in 3 environments. We followed the normal upgrade process but rolled the masters all at once.

@phungle This probably not the issue. After some trial and error, I upgraded the cluster to 1.12.7. However, it just works with etcd2, instead of etcd3. I can conclude a initial path about the upgrade is

  1. in 1.11, adopt etcd-manager and update
  2. in 1.11, adopt etcd2, by add the etcd version with kops set cluster cluster.spec.etcdClusters[*].version=2.2.1
  3. upgrade the cluster to 1.12.7

I also tried to remove the etcd version from etcdCluster blocks, but the cluster cannot work. Set the etcd version back to 2.2.1, and it comes back.

It is really a painful experience and confusing.

The weirdst thing happened to me.
I added added the Manager. For my surprise the etcd went to 3.2.24 (even though I just added the manager flag) and it worked in 1.11.

Then I decided to continue with the course of action and go to etcd 2.2.1 in 1.11 but then everything broke. It was kaos all over the place.

Then I reverted to etcd 3.xxxx. Even though I was on 1.11 it failed.

Then I decided to jump to Kubernetes 1.13, went to 1.12 and nothing worked

The only setup that worked with me so far was etcd 3.2.4 with 1.11.

Okey I manage to do it. So here are my steps:

  1. Backup 6 ETCD storage using snapshot

  2. Upgrade to manager Manager

kops set cluster cluster.spec.etcdClusters[*].provider=manager
kops update cluster 鈥攜es
# This will upgrade master all at once
kops rolling-update cluster --cloudonly --instance-group-roles master --master-interval=1s
# This will upgrade nodes progressively
kops rolling-update cluster --yes

PS: This will change your etcd to etcd3 directly! 
  1. Backup 6 ETCD storage using snapshot

  2. Change to 1.12.9 using kops edit cluster

    • Edit your instance groups with kope.io/k8s-1.12-debian-stretch-amd64-hvm-ebs-2019-08-16
kops edit instancegroups master-xxx
kops edit instancegroup nodes
kops update cluster 鈥攜es

  1. Upgrade master (will cause downtime - this was the tricky for me... somehow doing a cloud upgrade only on masters was causing lots of issues)

kops rolling-update cluster --cloudonly --master-interval=1s --node-interval=1s --yes

  1. Do the upgrade normally to 1.13 if you wish.

The upgraded suggested by @flmmartins above worked for me, so thanks for sharing!

However, beware, if you spin your cluster down and back up again (as many people do in testing environments), you will run into the issue described in #6605. The steps in the guide here will alleviate the issue for as long as the cluster is alive. But, as soon as you spin your cluster down at the end of the day to save :moneybag: and then try to spin it up the next day, the same issue will reoccur and you'll have to repeat the steps. I am testing different versions of k8s and Kops, as well as some good 'ole scripting, to see if I can find a good way around it, and will be documenting my progress in #7414 .

Hey @austinorth,

We actually never shutdown our dev cluster and we also did this procedures in production but knowing this is very great!
Once thing I noticed is that since the upgrade I'm facing issues during the execution of kops rolling-update. The cluster gets validated and is fine but Kops cannot detect it as valid after the creation of the new machine(s) and throws:

Cluster did not validate, will try again in "30s" until duration "5m0s" expires: error listing nodes

Then I have to run the rolling-update again and again for each machine which is annoying. Since I'm using non-approved version of Kubernetes 1.13.11 (which is not sanctioned by Kops yet) I thought it might be because of that so I didn't even investigate further...

Do you think this issue relates with yours? I will do further investigations once I have time and maybe open an issue.

Hey @flmmartins,

I've never run into that myself :thinking:. That minor version difference does sound like a probable cause, though, as every time I've had issues with the kops rolling-update command, it has been because of a difference between the Kops client and the Kubernetes version. We just made the jump to Kubernetes 1.14, for the record, which went super smoothly. Definitely recommend.

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Hey @austinorth I saw your issues and your problems seems to be related due to your applying Kops with Terraform right? That's what I could understand at least....
I will upgrade to 1.15 and would like to try shutting down the machines to save money.

@flmmartins hey! I have not tested without applying with Terraform, but the root cause of my issue and how I solved it is described here if you would like to take a look: https://github.com/kubernetes/kops/issues/7414#issuecomment-540835879 Ended up being an etcd thing.

I upgraded to 1.15.9 and now I can shutdown the machines again to save cash without issue ^^

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

owenmorgan picture owenmorgan  路  3Comments

chrislovecnm picture chrislovecnm  路  3Comments

minasys picture minasys  路  3Comments

drewfisher314 picture drewfisher314  路  4Comments

olalonde picture olalonde  路  4Comments