Rke: Rolling upgrade for k8s system components

Created on 1 Feb 2019 · 8Comments · Source: rancher/rke

There should be an option for rke up to update kubelet iteratively, one node at a time and only when all nodes are Ready.

RKE version: 0.1.15

*Docker version: (docker version,docker info preferred) 17.03.2-ce *

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) bare-metal

cluster.yml file:

Steps to Reproduce:

Results:

internal kinenhancement

Source

clkao

👍10

Most helpful comment

https://github.com/rancher/rke/pull/1800
This PR adds the required changes.
It ugprades the controlplane components one at a time and worker components such as kubelet/kube-proxy in user-configurable batches.
Users can optionally drain nodes before upgrade

mrajashree on 19 Feb 2020

🎉2

All 8 comments

Would be great to drain a node before upgrading it, and support PodDisruptionBudget to avoid disruption !

remche on 10 Feb 2019

This will ease the way to drain a node :

https://github.com/kubernetes/kubernetes/pull/72827

Meanwhile, do you think that manually draining nodes and doing a rke up on a partial cluster.yml is a way to achieve a no downtime update ?

remche on 27 Feb 2019

One of the reasons to do that is due to https://github.com/kubernetes/kubernetes/issues/74669, kubelet might fail to start. The current worker plane upgrade can cause multiple node failures.

Another option is like @remche suggested, to have additional CLI option to limit workplane upgrades to particular nodes. And people can choose to drain before upgrade if desired.

clkao on 27 Feb 2019

I can confirm that we've encountered #kubernetes/kubernetes#74669 when trying to upgrade the k8s version with rke 0.1.17. One of the nodes failed to start the kubelet while workloads were still running (but simultaneously they were being recreated on surviving nodes). Ultimately, this led to loss of data on rook-ceph block volumes that weren't being unmounted on the failed node. I'd very much appreciate, if someone can suggest workarounds to perform a rolling upgrade while this enhancement is still not on the horizon.

lllamnyp on 27 Apr 2019

@clkao @remche and anyone else looking for a solution:
If one specifies the ssh keys for access to nodes on a per node basis in the cluster.yml

nodes:
- node1
  ...
# ssh_key_path: something
...
- nodeN
  ...
  ssh_key_path: something

Then run rke up, uncomment one more ssh_key_path, run rke up again, etc, until all nodes are updated.
In this case the k8s components are not restarted. Specifically, the node, that is being subjected to rke up for the first time restarts all components. The nodes that have the ssh key commented out are untouched. A node that is subjected to rke up for a second or further time restarts etcd, etcd-rolling-snapshots and kube-apiserver. This seems to be a workable, though hacky way to achieve a rolling upgrade.

lllamnyp on 23 May 2019

👍1

We should add an upgrade strategy to all relevant k8s system components. Kubelet and kube-proxy are the most critical ones.

alena1108 on 22 Aug 2019

mrajashree on 19 Feb 2020

🎉2

Tested with RKE version v1.1.0-rc6

Verified the default upgrade strategy values for ingress addon :

Upgrade strategy =  rollingUpdate 
maxUnavailable=1

Verified the default upgrade strategy values for Networking addon :

Upgrade strategy =  rollingUpdate 
maxUnavailable=1

Verified the default upgrade strategy values for DNS addon :

Upgrade strategy =  rollingUpdate
maxUnavailable=1 and maxSurge=25%

Verified the default upgrade strategy values for metrics server addon :

Upgrade strategy =  rollingUpdate 
maxUnavailable=25% and maxSurge=25%

Verified that changing maxUnavailable , maxSurge fields in the upgrade strategy for the addons are updated correctly. Verified using kubectl commands

soumyalj on 20 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Support for configuring cloud-provider

de13 · 4Comments

RKE SSH: failed to connect to the following etcd host

armanriazi · 3Comments

"remote error: tls: bad certificate"

dan-evolvere · 3Comments

Consider using secret instead of configmap for new state file

riaan53 · 3Comments

Path /var/lib/kubelet is mounted on / but it is not a shared mount.

de13 · 3Comments